Contents


The latest version of this document can be found on the Marioni Lab GitHub page

library(ggplot2)
library(reshape2)
library(parallel)
library(Matrix)
library(pheatmap)
library(RColorBrewer)
library(reshape2)
library(scales)
library(ggsignif)
library(viridis)
library(quadprog)
library(BiocStyle)
library(DropletUtils)
library(here)
library(knitr)
library(cowplot)

folder_location = here()

ncores= 16

1 Introduction

With the rapid increase in throughput of next-generation sequencing technology, an individual run of a typical sequencing machine (e.g. those produced by Illumina) produces many more reads than is necessary for interpreting libraries generated by most functional genomic assays. Consequently, to make efficient use of these machines, DNA libraries are pooled together prior to sequencing, a process known as “multiplexing”. The ligation of unique barcodes onto DNA molecules within each library before pooling incorporates a known sequence into each read, allowing the separation of reads from different libraries after sequencing is completed. Multiplexing also protects against technical artefacts between samples (e.g. batch effects) and sample loss due to sequencing lane failures. As such, multiplexing is widely considerd to be good practice for many experiments, and is essential for cost effective analysis of small libraries such as those in single-cell RNA-sequencing experiments.

The most recent DNA-sequencing machines released by Illumina (HiSeq 3000/4000/X, X-Ten, and NovaSeq) use patterned flow cells to increase throughput and cost efficiency. On these new flow cells, the process of “seeding” DNA molecules into the patterned wells and amplification of the seeded DNA occur simultaneously. These machines have been in use for several years in a diverse range of genomic fields.

However, it has been recently reported that these machines can lead to the mislabelling of DNA molecules with the incorrect library barcode. A schematic showing the swapping method proposed by Sinha et al. (2017) is shown in Figure 1. The mislabelling is likely driven by the extension of free barcode molecules using other DNA molecules as a template. The phenomenon has been acknowledged by Illumina (Illumina 2017), although estimates of swapping rates vary between reports. It is unclear whether a permanent solution to the problem will be forthcoming, as rapid amplification after seeding is critical to the operation of the patterned flow cell machines (Sinha et al. 2017).

knitr::include_graphics(paste0(folder_location, "/figs/schematic.png"))
Swapping schematic. The method by which swapping is proposed to take place, from Sinha et al., is shown.

Figure 1: Swapping schematic
The method by which swapping is proposed to take place, from Sinha et al., is shown.

This “swapping” (also called “hopping” or “switching”) of barcode labels can easily confound analyses: reads labelled with a barcode specific to a given sample may have originated from any other sample multiplexed with it, rather than coming from the sample itself. This phenomenon is particularly relevant for rapidly developing single-cell techniques, where a large number of samples are necessarily multiplexed together.

This file steps through the analysis upon which the manuscript Detection and Removal of Barcode Swapping in Single-Cell RNA-seq data is based, with all code used present and accessible by clicking on the Code buttons.

Before continuing, we define a few terms:

2 Plate-based analysis: Richard data

2.1 Introduction to data

libs = read.table(paste0(folder_location, "/data/richard_4000.tab"), header = T, stringsAsFactors = F)
libs_2500 = read.table(paste0(folder_location, "/data/richard_2500.tab"), header = TRUE, stringsAsFactors = FALSE)
# There's an empty well on one of the plates too - we should mark this as "unexpected"
libs$expected[libs$empty_well] = FALSE
libs_2500$expected[libs_2500$empty_well] = FALSE

In the first dataset analysed, we consider two 96-well plates of single-cell RNA-seq libraries (mouse blood cells). We used dual indices for cell labelling (a different barcode was used at each end of the molecule). The barcodes used for each plate are from mutually exclusive sets - any barcode from one plate was never used on the other (Figure 2). For sequencing, all cell libraries from the two plates were multiplexed.

ggplot(data = libs, mapping = aes(x = bc1, y = bc2)) +
  geom_tile(aes(fill = expected), col = "grey50") +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Barcode 1", y="Barcode 2") +
  scale_fill_manual(values = c("FALSE" = "grey30", "TRUE" = "coral"), name = "", labels = c("No cell", "Cell present"))
Richard data experimental design. Cells were seeded in each of the orange positions in the plot. Neither set of barcodes used for a given 96-well plate were used in the other. One barcode combintion (N729,S522) was left without a cell. Each orange block is a single 96-well plate. Grey positions are locations where we do not expect to see sequencing reads.

Figure 2: Richard data experimental design
Cells were seeded in each of the orange positions in the plot. Neither set of barcodes used for a given 96-well plate were used in the other. One barcode combintion (N729,S522) was left without a cell. Each orange block is a single 96-well plate. Grey positions are locations where we do not expect to see sequencing reads.

Cells were prepared for single-cell RNA-seq by the SmartSeq2 protocol(Picelli et al. 2014) with the following modifications. Cells were sorted into 96-well plates holding 4 \(\mu\)L of lysis buffer composed of 2.3 U SUPERase In RNase inhibitor (Thermo Fisher Scientific), 0.11 % (v/v) Triton X-100 (Sigma), 12.5 mM DTT (Thermo Fisher Scientific), and 2.5 mM dNTP mix (Thermo Fisher Scientific). 1 \(\mu\)L annealing mix containing diluted ERCC RNA Spike-In Mix (Thermo Fisher Scientific) and 10 M oligo-dT30VN (Sigma) was added to each well before to reverse transcription with SuperScript II (Invitrogen). cDNA amplification was performed with 23 PCR cycles and the resulting PCR products purified with Ampure XP Beads (Agencourt) at a volume ratio of 0.7:1 beads:DNA. Libraries were prepared using Nextera XT DNA Sample Preparation Kit and indexes from the Nextera XT Index Kit v2 Set A and Nextera XT Index Kit v2 Set D (Illumina). Each plate of libraries was pooled, cleaned with Ampure XP beads, and quantitated using KAPA Library Quantification Kit (Roche). Equimolar quantities of libraries from each plate were combined for sequencing.

Read mapping was performed using Subread (v1.5.1; considering uniquely mapped reads and with Phred offset of 33, with otherwise default parameters) using the Ensembl mm10 genome with ERCC92 annotation.

Reads were counted using the featureCounts function in the Rsubread package Rsubread (default options, except minMQS=10) using the Ensembl mm10 GTF file, and Thermo-Fisher ERCC92 GTF (see the Marioni Lab Github).

Reads were demultiplexed allowing for any of the barcode combinations shown in Figure 2 (including the impossible combinations).

In the absence of barcode swapping, it should be impossible to observe mapped reads with barcodes from each of the two different plates, i.e., there should be no mapped reads in the grey areas of Figure 2. We will refer to these barcode combinations as “impossible” combinations, in comparison to the expected combinations in orange. Note that we consider mapped reads as these are less likely to derive from various sequencing artefacts or contamination from e.g. human or bacterial sources (especially in the empty well).

We plot in Figure 3 the observed numbers of mapped reads for each barcode combination. Where cells have been loaded, there are many mapped reads (yellow/green areas). In the impossible barcode combinations we observe a lower but non-zero number of mapped reads.

ggplot(data = libs, mapping = aes(x = bc1, y = bc2)) +
  geom_tile(aes(fill = log10(mapped))) +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_fill_viridis(name = "log10\nmapped\ncounts") +
  labs(x="Barcode 1", y="Barcode 2")
Number of mapped read per barcode combination. The number of mapped reads is shown for each barcode combination, as a log10 count.

Figure 3: Number of mapped read per barcode combination
The number of mapped reads is shown for each barcode combination, as a log10 count.

medfrac = median(libs$reads[!libs$expected])/median(libs$reads[libs$expected])
mappedfrac = median(libs$mapped[!libs$expected])/median(libs$mapped[libs$expected])
totfrac = sum(libs$mapped[!libs$expected])/sum(libs$mapped[libs$expected])

The distribution of these library sizes are shown in Figure 4 below. The impossible combinations have a median library size that is 1.5% of the median size of the expected combinations. The corresponding value for all reads (i.e. mapped and unmapped) is similar, at 1.8%. Considering all mapped reads on the plate, there are 1.1% in unexpected combinations relative to the expected combinations. Note that we consider the empty well to be an ``impossible’’ barcode combination though it does contain a low level of mapped ERCC spike-in reads.

pdf = melt(libs)

ggplot(data = pdf, mapping = aes(x = expected, y = value, fill = variable)) +
  geom_boxplot() +
  scale_y_log10(breaks = 10^(seq(3,7)), labels = c("1,000", "10,000", "100,000", "1,000,000", "10,000,000")) +
  labs(x = "Real cell", y = "Library size") +
  theme_bw() +
  scale_x_discrete(labels = c("\"Impossible\" barcode\ncombination", "Expected barcode\ncombination"), name = "") +
  scale_fill_manual(name = "Read type", values = c("royalblue2","springgreen3"), labels = c("All reads", "Mappable"))
Mapped library sizes of seeded and unseeded wells. The library sizes of the unexpected barcode combinations (i.e. wells without cells) are lower than those of the expected combinations (wells with cells), but considerably larger than 0.

Figure 4: Mapped library sizes of seeded and unseeded wells
The library sizes of the unexpected barcode combinations (i.e. wells without cells) are lower than those of the expected combinations (wells with cells), but considerably larger than 0.

2.2 Swapping rate estimation

If we make the assumption that barcode swapping is rare, then it is unlikely that one molecule will undergo more than one round of swapping. We also assume that the number of observed swapped transcripts is proportional to the number of transcripts that are available for swapping. If this is true, then the size of each of the impossible combinations will be proportional to the sum of the library sizes of the expected combinations that share a single barcode. These are the reads for which a single barcode swap could move them to the impossible barcode combination.

For example, the combination N719 and S505 (red position) would take swapping contributions from the cells in the blue positions in Figure 5.

is_target = libs$bc1 == "N719" & libs$bc2 == "S505"
avail = (libs$bc1 == "N719" | libs$bc2 == "S505") & libs$expected

schem_cols = rep("black", nrow(libs))
schem_cols[libs$expected] = "grey40"
schem_cols[is_target] = "indianred"
schem_cols[avail] = "cornflowerblue"
col_index = unique(schem_cols); names(col_index) = col_index

ggplot(data.frame(bc1 = libs$bc1, bc2 = libs$bc2, cols = schem_cols), aes(x = bc1, y = bc2)) +
  geom_tile(aes(fill = as.character(cols))) +
  scale_fill_manual(values = col_index) +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Barcode 1", y = "Barcode 2")
Considered 'available' transcript for barcode combination (N719,S505). We assume that the swapped transcript for the red labelled barcode combination will derive from the blue labelled combinations on the cell-loaded plate. These combinations need to swap only a single barcode to arrive at the red labelled cell. Cell-loaded combinations are shown in grey, and unloaded combinations in black.

Figure 5: Considered ‘available’ transcript for barcode combination (N719,S505)
We assume that the swapped transcript for the red labelled barcode combination will derive from the blue labelled combinations on the cell-loaded plate. These combinations need to swap only a single barcode to arrive at the red labelled cell. Cell-loaded combinations are shown in grey, and unloaded combinations in black.

We now more formally define the model:

Let \(X_{i,j}\) denote the mapped library size for the library indexed by barcodes \(i\) (\(i \in {1,\dots,16}\), along rows; \(i=1\Rightarrow \text{S522}, i=16\Rightarrow \text{S502}\)) and \(j\) (\(j \in {1,\dots,24}\), along columns; \(j=1\Rightarrow \text{N701}, j=24\Rightarrow \text{N729}\)). The experimental design results in cells being seeded in two blocks of 96 wells (see Figure 2), corresponding to \(1 \leq i \leq 8\), \(13 \leq j \leq 24\) and \(9 \leq i \leq 16\), \(1 \leq j \leq 12\). Remaining barcode combinations (in the off-diagonal blocks) were not seeded with any cells, or indeed reagents. Additionally the well with index \(i=1,j=24\) was empty by design.

Since index swapping is assumed to occur only for cells sharing one barcode, we assume that the library size of the unoccupied barcode combinations is proportional to the library size of cell-seeded barcode combinations that share a single index.

We therefore fit the model: \[X_{i,j} = \alpha + \beta S_{i,j}\] for

\[1 \leq i \leq 8, 1\leq j \leq 12, \text{and } 9\leq i \leq 16, 13\leq j \leq 24\]

where

\[S_{i,j} = \begin{cases} \sum_{k=13}^{24} X_{i,k} + \sum_{l=9}^{16} X_{l,j} \mbox{ for } 1 \leq i \leq 8, 1 \leq j \leq 12 \\ \sum_{k=1}^{12} X_{i,k} + \sum_{l=1}^{8} X_{l,j} \mbox{ for } 9 \leq i \leq 16, 13 \leq j \leq 24 \end{cases}\]

using ordinary least squares. Data and the model fit are shown in Figure 6.

#melting the data frame purely to make casting it easier
molten = melt( libs[libs$expected,] , id.vars = c( "bc1" , "bc2" ) , measure.vars = "mapped" )
#casted to make easier to rowSums/colSums to get the contributions per barcode
casted = acast( molten , bc1~bc2 )
nsums = rowSums(casted, na.rm = T)
ssums = colSums(casted, na.rm = T)

unexp = libs[!libs$expected, ]
#now make the data frame from unexpected combinations for model fit
cell_outs = lapply(1:nrow(unexp), function(x) data.frame(mapped = unexp$mapped[x],
                                             col = ssums[unexp$bc2[x]],
                                             row = nsums[unexp$bc1[x]]))

df = do.call(rbind, cell_outs)
df$tot = df$col + df$row

mod_mapped = lm(df$mapped ~ df$tot)


ggplot(df, aes(x = tot, y = mapped)) +
  geom_point(col = "grey50") +
  geom_smooth(method = "lm", se = FALSE, col = "black", formula = y ~ x) +
  # scale_x_log10() + scale_y_log10() +
  labs(x = expression("Available swapping reads,"~S["i,j"]), y = expression("Observed swapped reads,"~X["i,j"])) +
  theme_bw() +
    scale_x_continuous(breaks = c(5e6, 1e7, 1.5e7), labels = c("5,000,000", "10,000,000", "15,000,000")) +
  scale_y_continuous(breaks = seq(2500, 10000, 2500), labels = c("2,500", "5,000", "7,500", "10,000"))+
  theme(axis.text = element_text(face = "bold", size = 11), 
       axis.title.y = element_text(size = 13, face = "bold"), 
       axis.title.x = element_text(size = 13, face = "bold")) +
  annotate("text", x = 5500000, y = 8000, label = paste0("italic(R)^2 == ", format(summary(mod_mapped)$r.squared, digits = 2)), parse = TRUE, size = 9)
Swapping model fit. The one-barcode-distant available transcript and observed swapped transcript counts are shown, with the model fit described above overlaid.

Figure 6: Swapping model fit
The one-barcode-distant available transcript and observed swapped transcript counts are shown, with the model fit described above overlaid.

#retain for later use
df_4000 = df

Consider the units of \(\beta\), the slope of the regression above. We have measured, in our regression, the number of transcripts arriving in each of the recipient libraries from 20 donor libraries. However, consider also that each of the donor libraries is contributing to 20 recipient libraries in all, as shown in Figure 5. The unit of \(\beta\) is the swapping rate from one donor library to one recipient library, as both the predictor and outcome variable are effectively multiplied by the same constant of 20:

\[[\beta] = \frac{\text{swapped reads}/recipient}{\text{original reads}/donor} \times \frac{20~donors}{20~donors}\]

So, \(\beta\) indicates the swapping rate from one donor library to one recipient library. However, the total swap rate should be the fraction of donor reads that are swapped from one library into all available libraries, i.e. the total fraction of swapped reads.

There are 38 available destinations for a single-barcode-swapped read from one of the donor libraries, as shown in Figure 7.

is_target = libs$bc1 == "N705" & libs$bc2 == "S505"
avail = (libs$bc1 == "N705" | libs$bc2 == "S505") #& libs$expected

schem_cols = rep("black", nrow(libs))
schem_cols[libs$expected] = "grey40"
schem_cols[avail] = "cornflowerblue"
schem_cols[is_target] = "indianred"
col_index = unique(schem_cols); names(col_index) = col_index

ggplot(data.frame(bc1 = libs$bc1, bc2 = libs$bc2, cols = schem_cols), aes(x = bc1, y = bc2)) +
  geom_tile(aes(fill = as.character(cols))) +
  scale_fill_manual(values = col_index) +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Barcode 1", y = "Barcode 2")
Pattern of swapping. Our model was fitted to the empty barcode combinations, which received libraries from the real cells. Each of those donor cells (one is shown here in red) swaps not into one particular library, but into all 38 recipient libraries that share a single barcode (here shown in blue). Cell-loaded combinations are shown in grey, and unloaded combinations in black.

Figure 7: Pattern of swapping
Our model was fitted to the empty barcode combinations, which received libraries from the real cells. Each of those donor cells (one is shown here in red) swaps not into one particular library, but into all 38 recipient libraries that share a single barcode (here shown in blue). Cell-loaded combinations are shown in grey, and unloaded combinations in black.

Therefore the swapping rate in this experiment \(R\) is \(38\beta\):

\[R = 38\beta\]

\[[R] = \frac{38~\text{recipients}}{\text{donor}} \times\frac{\text{swapped reads}/recipient}{\text{original reads}/donor} = \frac{\text{swapped reads}}{\text{original reads}}\]

This provides an estimated swapping rate of 2.187$$0.0765%. Notably, this is higher than the median-to-median fraction of 1.5%. This is because the median-to-median fraction only considered swapped reads arriving in the unexpected barcode combinations, whereas this slope-estimated value considers reads swapping across the entire set of barcode combinations.

The intercept value for the model is -398.2$$168.8, which is close to zero as expected.

2.3 Negative control: HiSeq 2500

The exact same pool of libraries was also sequenced on a HiSeq 2500. We see strong correlation between library sizes from the two machines, as shown in Figure 8.

libs = libs[order(libs$bc1, libs$bc2),]
libs_2500 = libs_2500[order(libs_2500$bc1, libs_2500$bc2),]

if(!all( (libs$bc1 == libs_2500$bc1) & (libs$bc2 == libs_2500$bc2) )){
  print("barcodes don't match")
}

ggplot(data = data.frame(hiseq_4000 = libs$mapped, hiseq_2500 = libs_2500$mapped, expected = libs$expected), 
       mapping = aes(x=hiseq_4000, y = hiseq_2500, col = factor(expected, levels = c(TRUE, FALSE), labels = c("Real cell", "\"Impossible\" barcode\ncombination")))) +
  geom_point(alpha = 0.7) +
  scale_x_log10( breaks = 10^(4:6), labels = c("10,000", "100,000", "1,000,000"), name = "HiSeq 4000 Library size") +
  scale_y_log10( breaks = 10^(3:5), labels = c("1,000", "10,000", "100,000"), name = "HiSeq 2500 library size") +
  scale_color_brewer(palette = "Set1", name = "") +
  theme_bw()
Library sizes between sequencing machines. We observe excellent correlation between library sizes of the two sequencing machines, particularly for the real cell libraries.

Figure 8: Library sizes between sequencing machines
We observe excellent correlation between library sizes of the two sequencing machines, particularly for the real cell libraries.

mappedfrac_2500 = median(libs_2500$mapped[!libs_2500$expected])/median(libs_2500$mapped[libs_2500$expected])
totfrac_2500 = sum(libs_2500$mapped[!libs_2500$expected])/sum(libs_2500$mapped[libs_2500$expected])

In the HiSeq 2500 data, many fewer reads are present in the unexpected barcode combinations. The unexpected combinations have a median library size of 0.11% the median size of the expected combinations (vs. 1.5% from HiSeq 4000). Considering all mapped reads on the plate, there are 0.084% in the unexpected combinations as in the expected ones (vs. 1.1% from HiSeq 4000).

We have applied the same model to the HiSeq 2500 data as described above for the HiSeq 4000 data. The fit and data is displayed graphically in Figure 9

molten = melt( libs_2500[libs_2500$expected,] , id.vars = c( "bc1" , "bc2" ) , measure.vars = "mapped" )
casted = acast( molten , bc1~bc2 )
nsums = rowSums(casted, na.rm = T)
ssums = colSums(casted, na.rm = T)

unexp = libs_2500[!libs_2500$expected, ]

cell_outs = lapply(1:nrow(unexp), function(x) data.frame(mapped = unexp$mapped[x],
                                             col = ssums[unexp$bc2[x]],
                                             row = nsums[unexp$bc1[x]]))

df = do.call(rbind, cell_outs)
df$tot = df$col + df$row

df_outlier = df[-which.max(df$mapped),]
mod_mapped_outlier = lm(df_outlier$mapped ~ df_outlier$tot)
mod_mapped_2500 = lm(df$mapped ~ df$tot)

ggplot(df, aes(x = tot, y = mapped)) +
    geom_abline(slope = coef(mod_mapped_outlier)[2], intercept = coef(mod_mapped_outlier)[1], lwd = 1.5 ) +
  geom_point(x = df$tot[which.max(df$mapped)], y = max(df$mapped), col = "coral", size = 3, alpha = 1) +
  geom_point(col = "grey50") +
  # geom_smooth(method = "lm", se = FALSE, col = "black", formula = y ~ x) +
  # scale_x_log10() + scale_y_log10() +
  labs(x = expression("Available swapping reads,"~S["i,j"]), y = expression("Observed swapped reads,"~X["i,j"])) +
  theme_bw() +
    scale_x_continuous(breaks = c(5e6, 1e7), labels = c("5,000,000", "10,000,000")) +
  scale_y_continuous(breaks = c(0, 600, 1200), labels = c("0", "600", "1,200"), limits = c(0,1250))+
  theme(axis.text = element_text(face = "bold", size = 11), 
       axis.title.y = element_text(size = 13, face = "bold"), 
       axis.title.x = element_text(size = 13, face = "bold")) +
  annotate(geom = "text", x = df$tot[which.max(df$mapped)], y = max(df$mapped)-75, label = "Outlier removed from model fit")
Model fit for HiSeq 2500 data. A similar trend as for the HiSeq 4000 data can be observed. We have excluded one outlying data point from the fit to prevent disproportionate influence on the prediction of the slope.

Figure 9: Model fit for HiSeq 2500 data
A similar trend as for the HiSeq 4000 data can be observed. We have excluded one outlying data point from the fit to prevent disproportionate influence on the prediction of the slope.

Interestingly, we do still observe the swapping pattern. However, the swapping rate estimated is around an order of magnitude lower on the HiSeq 2500: 0.224$\(0.00955% on the HiSeq 2500 vs. 2.187\)$0.0765% on the HiSeq 4000.

3 Plate-based analysis: Nestorowa data

3.1 Introduction to data

#Load the 4000/2500 blood data
noswap = as.matrix(read.table(paste0(folder_location, "/data/nest_2500.tab"), header = T, row.names = 1))
swap = as.matrix(read.table(paste0(folder_location, "/data/nest_4000.tab"), header = T, row.names = 1))

plate_names = unique(substr(colnames(swap), start = 1, stop = 8))

#some of the cells had to be re-pooled, so not sequencing the exact same libraries. Remove these.
newpool_lanes = which(grepl("9527", colnames(swap)) | grepl("9548", colnames(swap)) | grepl("9549", colnames(swap)) | grepl("9550", colnames(swap)))

noswap_all = noswap[, -newpool_lanes]
swap_all = swap[, -newpool_lanes]

#drop spikes, should be same level in all wells --> no net swapping
noswap = noswap[grepl("ENSMUSG", rownames(noswap)), -newpool_lanes]
swap = swap[grepl("ENSMUSG", rownames(swap)), -newpool_lanes]
#split into plates
blood_meta = data.frame(cell = colnames(swap))
split_1 = strsplit(as.character(blood_meta$cell), split = ".", fixed = T)
blood_meta$plate = sapply(split_1, function(x) x[2])
split_2 = strsplit(sapply(split_1, function(x) x[3]), split = "_")
blood_meta$column = sapply(split_2, function(x) x[1])
blood_meta$row = sapply(split_2, function(x) x[2])

swap_split = lapply(split(as.data.frame(t(swap)), blood_meta$plate), function(x) t(as.matrix(x)))
noswap_split = lapply(split(as.data.frame(t(noswap)), blood_meta$plate), function(x) t(as.matrix(x)))

To confirm the existance of barcode swapping in multiple datasets, we have used data from a published study (Nestorowa et al. 2016), which we refer to as the Nestorowa data. Each of 16 plates of single-cell libraries was pooled and sequenced in a single lane on a HiSeq 2500, and at a later date the same pooled libraries were sequenced on a HiSeq 4000. Because exactly the same pools of libraries were used, the only differences between the results should be caused by the sequencing machines and Poisson sampling noise (Marioni et al. 2008). It is important that the libraries were not repooled, as this would likely induce systematic differences in the proportions that make up the pool, which may confound results.

3.2 Crosshair swapping patterns

First, we identified the gene with the most read counts in a single cell in the first plate of cells - Igkc. This gene is almost uniquely expressed in one cell in the plate. On both machines, we observe the same pattern of “crosshair” swapping (along the row and column of a highly expressing cell) as reported in Sinha et al. (2017). This is shown in Figures 10 and 11.

#small function to get the fraction of counts on the whole plate
#in the crosshair of the maximally expressing cell
get_fraction = function(gene, meta, counts){
  expr = counts[gene,]
  cell = names(expr)[which.max(expr)]
  
  bc1 = meta$column[meta$cell == cell]
  bc2 = meta$row[meta$cell == cell]
  
  cells = as.character(meta$cell[meta$column == bc1 | meta$row == bc2])
  cells = cells[-which(cells == cell)]
  
  
  return(sum(expr[cells])/expr[cell])
  
}

#generates a crosshair-like plot
plate_plot = function(gene, meta, counts, log_transform = T, plot = F){
  
  if(nrow(meta)!=ncol(counts)){
    stop("meta/counts not same size")
  }
  
  if(log_transform){
    expr = log2(counts[gene,] + 1 )
  } else {
    expr = counts[gene, ]
  }
  mat = matrix(NA, 
               ncol = length(unique(meta$column)), 
               nrow = length(unique(meta$row)), 
               dimnames = list(
                 unique(meta$row)[order(unique(meta$row))],
                 unique(meta$column)[order(unique(meta$column))]
               ))
  for(i in 1:length(expr)){
    mat[meta$row[i], meta$column[i]] = expr[i]
  }
    
    # pheatmap(mat, cluster_rows = F, cluster_cols = F, display_numbers = T,
    #         color = colorRampPalette(brewer.pal(n = 7, name ="Spectral"))(100),
    #         breaks = seq(from = 0, to = max(mat, na.rm = T) + 1e-8, length.out = 101))
    
    melt = melt(mat)
    plot_out = ggplot(melt, aes(x = Var2, y = Var1)) +
      geom_tile(aes(fill = value), color = "white") +
      scale_fill_gradientn(colors = c("grey70", "cornflowerblue", "black"), name = "log2(count+1)") +
      # scale_fill_viridis(name = "log2(count+1)") +
      annotate(geom = "text", x = melt$Var2, y = melt$Var1, label = format(melt$value, digits = 2), color = "white") +
      labs(x = "", y = "")
    
  if(plot){
    print(plot_out)
  }
  return(list(matrix = mat, plot = plot_out))
}
plate_2500 = plate_plot(gene = rownames(noswap_split[[1]])[row(noswap_split[[1]])[which.max(noswap_split[[1]])]], 
                        meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),],
                        counts = noswap_split[[1]])

print(plate_2500[[2]] + ggtitle("HiSeq 2500"))
Crosshair pattern on HiSeq 2500. The highly expressed gene at position (N708, S508) is seen to increase the read counts for cells that share a single barcode. The experimental design should in the experiment does not produce this behaviour.

Figure 10: Crosshair pattern on HiSeq 2500
The highly expressed gene at position (N708, S508) is seen to increase the read counts for cells that share a single barcode. The experimental design should in the experiment does not produce this behaviour.

frac_2500 = get_fraction(gene = rownames(noswap_split[[1]])[row(noswap_split[[1]])[which.max(noswap_split[[1]])]],
                         meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),],
                         counts = noswap_split[[1]])
plate_4000 = plate_plot(gene = rownames(swap_split[[1]])[row(noswap_split[[1]])[which.max(noswap_split[[1]])]], 
                        meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),],
                        counts = swap_split[[1]])

print(plate_4000[[2]] + ggtitle("HiSeq 4000"))
Crosshair pattern on HiSeq 4000. The highly expressed gene at position (N708, S508) is seen to increase the read counts for cells that share a single barcode. The experimental design used in the experiment should not produce this behaviour.

Figure 11: Crosshair pattern on HiSeq 4000
The highly expressed gene at position (N708, S508) is seen to increase the read counts for cells that share a single barcode. The experimental design used in the experiment should not produce this behaviour.

frac_4000 = get_fraction(gene = rownames(swap_split[[1]])[row(noswap_split[[1]])[which.max(noswap_split[[1]])]], 
                         meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),],
                         counts = swap_split[[1]])

It is clear that the pattern is stronger on the HiSeq 4000. There are 1.93% as many Igkc reads in the crosshair as in the central highly expressing cell in the 4000 data, compared to 0.257% for the HiSeq 2500. This is consistent with the approximately order of magnitude difference identified in the swapping rate between the two technologies from the Richard data.

We can show the divergence between the machines more clearly by considering the difference between the log2 counts, shown in Figure 12.

delta = plate_4000[[1]] - plate_2500[[1]]
melt = melt(delta)
plot_out = ggplot(melt, aes(x = Var2, y = Var1)) +
  geom_tile(aes(fill = value), color = "white") +
  # scale_fill_gradientn(colors = c("coral" ,"grey70", "cornflowerblue", "black"), name = "delta log2(count+1)", values = rescale(c(min(melt$value), 0, max(melt$value)/2, max(melt$value)))) +
  scale_fill_gradientn(colors = c("royalblue3" ,"grey80", "indianred3"), name = "delta log2(count+1)", values = rescale(c(min(melt$value), 0, max(melt$value)))) +
  annotate(geom = "text", x = melt$Var2, y = melt$Var1, label = format(melt$value, digits = 2), color = "white") +
  labs(x = "", y = "") +
  ggtitle("HiSeq 4000 - HiSeq 2500")

plot_out
Sequencing machine crosshair differences. The difference between the log2 counts of the gene Igkc is shown between the two sequencing machines - more reads are concentrated in the crosshair on the HiSeq 4000 than the HiSeq 2500, which is suggestive of swapping.

Figure 12: Sequencing machine crosshair differences
The difference between the log2 counts of the gene Igkc is shown between the two sequencing machines - more reads are concentrated in the crosshair on the HiSeq 4000 than the HiSeq 2500, which is suggestive of swapping.

These crosshair patterns show that swapping along rows and columns is the primary mode by which swapping occurs. This is likely because swapping is a rare event, so for swapping to affect both ends of a transcript is rare (\(0.5p_{swap}^2\), as swaps must occur at opposite ends of the molecule). This justifies our focus on row-column swapping in these sections.

However, it is possible that there are some more subtle effects - for example, an excess of barcode for a particular row or column of a plate may result in a greater amount of swapping to barcode combinations that contain this particular barcode. We cannot readily consider these, as library quantification captures quantities of DNA lengths, not of their sequence content, so different barcodes are indistinguishable.

3.3 Quantifying swapping rate

We applied a model that identifies contributions of different cells in the HiSeq 2500 data (treating this as a baseline, low-swapping system) to the swapping-affected transcriptomes of the HiSeq 4000 data. This will allow quantification of an overall swapping rate across multiple genes.

Let \(C^{4000}_{g \times c}\) denote the count matrix for the HiSeq 4000 libraries, where position (\(g_i\), \(c_j\)) corresponds to the expression level of gene \(i\) in cell \(j\). Let \(C^{2500}_{g \times c}\) denote the count matrix for the HiSeq 2500 libraries. Let \(M_{c \times c}\) denote a matrix capturing the contribution of the HiSeq 2500 counts to the HiSeq 4000 counts. Each entry of \(M_{c \times c}\) defines the proportion of each HiSeq 2500 library that makes up each HiSeq 4000 library.

More specifically, if \(c\) indexes the columns of \(M\) (\(1 \leq c \leq cell\)) and \(r\) indexes the rows (\(1 \leq r \leq cell\)), then \((r, c)\) of \(M_{c\times c}\) represents the contribution of cell \(r\)’s expression profile in the HiSeq 2500 data to the expression profile of cell \(c\) in the HiSeq 4000 data. For a cell with a given pair of barcodes, we want to discriminate between the contribution of other cells with exactly one shared barcode and other cells with no shared barcodes to the expression profile of the cell of interest. To do this, we define the elements of the matrix \(M_{r,c}\) as:

\[ M_{r,c} = \begin{cases} \alpha & \text{if cell $c$ and cell $r$ share both barcodes},\\ \beta & \text{if cell $c$ and cell $r$ share exactly one barcode},\\ \gamma & \text{if cell $c$ and cell $r$ share neither barcodes}. \end{cases} \]

Subsequently, let:

\[ C^{4000}_{g \times c} = C^{2500}_{g \times c} M_{c \times c} + \epsilon\]

where \(\epsilon\) represents the residual error. The aim is to obtain estimates of \(\alpha\), \(\beta\), and \(\gamma\) for each plate, using information across many genes to stabilise the estimates.

Gene selection is important for the fitting of this model. We look at the expression of two genes on the first HiSeq 2500 plate in the dataset in Figures 13 and 14

rowMax = function(mat){
  return(apply(mat, 1, max))
}

#pull out some genes for demo of selection
gene_vars = apply(noswap_split[[1]][rowMax(noswap_split[[1]])>500,], 1,  function(x) max(x)/quantile(x, 0.9))
gene_vars = gene_vars[order(gene_vars, decreasing = TRUE)]

plot1 = plate_plot(gene = names(gene_vars[30]), meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),], counts = noswap_split[[1]], log_transform = TRUE, plot = FALSE)

print(plot1[[2]] +ggtitle("Gene A"))
Gene expression pattern of gene A. The expression of gene A is shown in log-counts for cells on the first plate of this dataset.

Figure 13: Gene expression pattern of gene A
The expression of gene A is shown in log-counts for cells on the first plate of this dataset.

plot2 = plate_plot(gene = names(gene_vars[1000]), meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),], counts = noswap_split[[1]], log_transform = TRUE, plot = FALSE)

print(plot2[[2]] + ggtitle("Gene B"))
Gene expression pattern of gene B. The expression of gene B is shown in log-counts for cells on the first plate of this dataset. Gene B is expressed widely across the plate.

Figure 14: Gene expression pattern of gene B
The expression of gene B is shown in log-counts for cells on the first plate of this dataset. Gene B is expressed widely across the plate.

Both genes are expressed in only a subset of the cells on the plate, but gene B is expressed much more broadly than gene A. This poses a problem for the deconvolution model: for gene B, it is harder for the model to distinguish between contributions from other cells on the row and column of a certain cell (\(\beta\)), and from all the other cells on the plate (\(\gamma\)), because many cells in both of these sets express the gene at high levels. By contrast, gene A is expressed at a high level in only very few cells; it is therefore easier for the model to distinguish between large values of \(\beta\) and \(\gamma\). Genes that are expressed broadly in the manner of gene B are therefore less informative than genes like gene A for the model fit.

In order to capture the most informative genes, we first subset by expression levels: genes are retained if they are present at a minimum level of 500 counts in at least one cell in the HiSeq 2500 data. It is important to have a large number of transcripts present, otherwise swapping may be too rare to detect.

Then, to find informative genes, we rank them according to their maximum expression on the plate divided by the 90th percentile expression. This value will be highest when only a very few cells are highly expressing a gene.

These scores are plotted in decreasing order for a few plates chosen at random in Figure 15:

set.seed(42)
plates = sample(length(noswap_split), 4)
gene_vars = lapply(plates, function(i) 
  apply(noswap_split[[i]][rowMax(noswap_split[[i]])>500,], 1,  function(x) max(x)/quantile(x, 0.9)))
gene_vars = lapply(gene_vars, function(x) x[order(x, decreasing = TRUE)])
gene_vars = lapply(gene_vars, function(x) x[x!=Inf])

plate_ids = lapply(1:length(gene_vars), function(x) rep(plates[x], length(gene_vars[[x]])))
indices = lapply(gene_vars, function(x) 1:length(x))

plot_df = data.frame(score = unlist(gene_vars), plate = paste("plate", unlist(plate_ids)), index = unlist(indices))

ggplot(plot_df, aes(y = score, x = index)) +
  geom_point(size = 0.5, col = "darkgrey") +
  facet_wrap(~plate, nrow = 2) +
  scale_y_log10(breaks = c(10, 100, 1000), labels = c("10", "100", "1,000"), name = "max / 90% quantile") +
  theme_bw() +
  lims(x = c(1, 1000))
Informative genes for four plates. For four plates, chosen at random from the dataset, genes are ranked in descending order by ratio of their maximum expression on the plate and the 90% quantile expression. Genes are plotted in descending order, with infinite values of the ratio excluded from plots, but used in the models (there are at most 33 such ratios in one plate). Only the top 1000 genes are shown.

Figure 15: Informative genes for four plates
For four plates, chosen at random from the dataset, genes are ranked in descending order by ratio of their maximum expression on the plate and the 90% quantile expression. Genes are plotted in descending order, with infinite values of the ratio excluded from plots, but used in the models (there are at most 33 such ratios in one plate). Only the top 1000 genes are shown.

In these plates, the top 250 genes contain most of the informative genes. We consider 500 genes to ensure that we include the infinite ratios (i.e. where the 90th percentile is 0, the ratio is not defined) and do not exclude informative genes for other plates than those shown above, which may have longer tails to the distribution shown. We have also run fits considering both 250 genes and 1000 genes, and obtain similar parameter estimates (shown below, in Figure 19).

Notably, the maximum expression value on a plate does not substantially drive selection of genes, shown for two plates in Figure 16.

vars_1 = apply(noswap_split[[1]][rowMax(noswap_split[[1]])>500,], 1,  function(x) max(x)/quantile(x, 0.9))
maxs_1 = rowMax(noswap_split[[1]][rowMax(noswap_split[[1]])>500,])
keep_1 = vars_1 >= vars_1[order(vars_1, decreasing = TRUE)][500]

vars_2 = apply(noswap_split[[2]][rowMax(noswap_split[[2]])>500,], 1,  function(x) max(x)/quantile(x, 0.9))
maxs_2 = rowMax(noswap_split[[2]][rowMax(noswap_split[[2]])>500,])
keep_2 = vars_2 >= vars_2[order(vars_2, decreasing = TRUE)][500]

plot_df = data.frame(score = c(vars_1, vars_2), max = c(maxs_1, maxs_2), plate = c(rep("Plate 1", length(vars_1)), rep("Plate 2", length(vars_2))), keep = c(keep_1, keep_2))

plot_df = plot_df[plot_df$score!= Inf,]

ggplot(plot_df, aes(x = max, y = score, col = factor(keep))) +
  geom_point() +
  facet_grid(. ~ plate) +
  scale_x_log10(name = "Maximum gene expression count", breaks = 10^(3:5), labels = c("1,000", "10,000", "100,000")) + scale_y_log10("Max/90% score") +
  scale_color_manual(labels = c("Dropped gene", "Analysed gene"), name = "", values = c("darkgrey", "coral"))
Effect of expression magnitude on gene selection. The ratio we use to select genes is not driven strongly by the size of the numerator: we are not simply selecting for genes with the highest expression in one cell.

Figure 16: Effect of expression magnitude on gene selection
The ratio we use to select genes is not driven strongly by the size of the numerator: we are not simply selecting for genes with the highest expression in one cell.

The model fit used Poisson precision weights: specifically, counts (both response \(C^{4000}\) and predictors \(C^{2500}\)) were multiplied by the reciprocal of the square-root of their gene’s mean expression, to prevent the most highly expressed genes dominating the least-squares fit. A constrained linear inverse model was used for the fit to avoid obtaining negative values of \(\alpha\), \(\beta\), and \(\gamma\) (using the package lsei). Gene subsetting and fitting of the model was undertaken separately for each plate of cells, so each plate uses its own informative gene set.

#This chunk is particularly slow - perhaps run on only a subset of plates if you want speed

#we needed to add a line to the source code of this function to avoid
#an overflow issue in fortran
#The line just divides all the values of a matrix by a constant value
#otherwise the function is unchanged from the package lsei
my_lsei = function (A = NULL, B = NULL, E = NULL, F = NULL, G = NULL, 
  H = NULL, Wx = NULL, Wa = NULL, type = 1, tol = sqrt(.Machine$double.eps), 
  tolrank = NULL, fulloutput = FALSE, verbose = TRUE) 
{
  require(quadprog)
  
  if (is.vector(E) & length(F) == 1) 
    E <- t(E)
  else if (!is.matrix(E) & !is.null(E)) 
    E <- as.matrix(E)
  if (is.vector(A) & length(B) == 1) 
    A <- t(A)
  else if (!is.matrix(A) & !is.null(A)) 
    A <- as.matrix(A)
  if (is.vector(G) & length(H) == 1) 
    G <- t(G)
  else if (!is.matrix(G) & !is.null(G)) 
    G <- as.matrix(G)
  if (!is.matrix(F) & !is.null(F)) 
    F <- as.matrix(F)
  if (!is.matrix(B) & !is.null(B)) 
    B <- as.matrix(B)
  if (!is.matrix(H) & !is.null(H)) 
    H <- as.matrix(H)
  if (is.null(A) && is.null(E)) {
    if (is.null(G)) 
      stop("cannot solve least squares problem - A, E AND G are NULL")
    A <- matrix(data = 0, nrow = 1, ncol = ncol(G))
    B <- 0
  }
  else if (is.null(A)) {
    A <- matrix(data = E[1, ], nrow = 1)
    B <- F[1]
  }
  Neq <- nrow(E)
  Napp <- nrow(A)
  Nx <- ncol(A)
  Nin <- nrow(G)
  if (is.null(Nx)) 
    Nx <- ncol(E)
  if (is.null(Nx)) 
    Nx <- ncol(G)
  if (is.null(Nin)) 
    Nin <- 1
  if (is.null(Neq)) {
    Neq <- 0
    if (verbose & type == 1) 
      warning("No equalities - setting type = 2")
    type = 2
    F <- NULL
  }
  else {
    if (ncol(E) != Nx) 
      stop("cannot solve least squares problem - A and E not compatible")
    if (length(F) != Neq) 
      stop("cannot solve least squares problem - E and F not compatible")
  }
  if (is.null(G)) 
    G <- matrix(data = 0, nrow = 1, ncol = Nx)
  if (is.null(H)) 
    H <- 0
  if (ncol(G) != Nx) 
    stop("cannot solve least squares problem - A and G not compatible")
  if (length(B) != Napp) 
    stop("cannot solve least squares problem - A and B not compatible")
  if (length(H) != Nin) 
    stop("cannot solve least squares problem - G and H not compatible")
  if (!is.null(Wa)) {
    if (length(Wa) != Napp) 
      stop("cannot solve least squares problem - Wa should have length = number of rows of A")
    A <- A * Wa
    B <- B * Wa
  }
  Tol <- tol
  if (is.null(Tol)) 
    Tol <- sqrt(.Machine$double.eps)
  IsError <- FALSE
  if (type == 1) {
    ineq <- Nin + Nx
    mIP <- ineq + 2 * Nx + 2
    lpr <- 1
    if (fulloutput) 
      lpr <- lpr + 3
    if (!is.null(tolrank)) 
      lpr <- lpr + 6
    if (!is.null(Wx)) {
      lw <- length(Wx)
      lpr <- lpr + 2 + lw
    }
    ProgOpt <- rep(1, lpr)
    if (lpr > 1) {
      ipr <- 1
      if (fulloutput) {
        ProgOpt[ipr:(ipr + 2)] <- c(ipr + 3, 1, 1)
        ipr <- ipr + 3
      }
      if (!is.null(tolrank)) {
        if (length(tolrank) == 1) 
          tolrank <- rep(tolrank, len = 2)
        ProgOpt[ipr:(ipr + 5)] <- c(ipr + 6, 4, tolrank[1], 
          ipr + 6, 5, tolrank[2])
        ipr <- ipr + 6
      }
      if (!is.null(Wx)) {
        lw <- length(Wx)
        if (lw == 1) {
          ProgOpt[ipr:(ipr + 2)] <- c(ipr + 3, 2, 1)
        }
        else {
          if (lw != Nx) 
            stop("cannot solve least squares problem - number of weighs should be =1 or =number of unknowns")
          lw <- lw + ipr + 1
          ProgOpt[ipr:lw] <- c(lw + 1, 3, Wx)
        }
      }
    }
    mdW <- Neq + Napp + ineq
    if (fulloutput) 
      mdW <- max(mdW, Nx)
    mWS <- 2 * (Neq + Nx) + max(Napp + ineq, Nx) + (ineq + 
      2) * (Nx + 7)
    storage.mode(A) <- storage.mode(B) <- "double"
    storage.mode(E) <- storage.mode(F) <- "double"
    storage.mode(G) <- storage.mode(H) <- "double"
    sol <- .Fortran("lsei", NUnknowns = Nx, NEquations = Neq, 
      NConstraints = Nin, NApproximate = Napp, A = A, 
      B = B, E = E, F = F, G = G, H = H, X = as.vector(rep(0, 
        Nx)), mIP = as.integer(mIP), mdW = as.integer(mdW), 
      mWS = as.integer(mWS), IP = as.integer(rep(0, mIP)), 
      W = as.double(matrix(data = 0, nrow = mdW, ncol = Nx + 
        1)), WS = as.double(rep(0, mWS)), lpr = as.integer(lpr), 
      ProgOpt = as.double(ProgOpt), verbose = as.logical(verbose), 
      IsError = as.logical(IsError))
    if (any(is.infinite(sol$nX))) 
      sol$IsError <- TRUE
    if (fulloutput) {
      covar <- matrix(data = sol$W, nrow = mdW, ncol = Nx + 
        1)[1:Nx, 1:Nx]
      RankEq <- sol$IP[1]
      RankApp <- sol$IP[2]
    }
  }
  else if (type == 2) {
    if (!is.null(Wx)) 
      stop("cannot solve least squares problem - weights not implemented for type 2")
    if (!is.null(Wa)) 
      stop("cannot solve least squares problem - weights not implemented for type 2")
    dvec <- crossprod(A, B)
    Dmat <- crossprod(A, A)
    diag(Dmat) <- diag(Dmat) + 1e-08
    
    #THIS IS THE ADDED CODE
    #As per answer 2 at http://stackoverflow.com/questions/28381855/r-function-solve-qp-error-constraints-are-inconsistent-no-solution?rq=1 it appears we have overflow in the fortran code, so we scale as recommended
    sc = base::norm(Dmat, "2")
    Dmat = Dmat/sc
    dvec = dvec/sc
    
    Amat <- t(rbind(E, G))
    bvec <- c(F, H)
    sol <- solve.QP(Dmat, dvec, Amat, bvec, meq = Neq)
    sol$IsError <- FALSE
    sol$X <- sol$solution
  }
  else stop("cannot solve least squares problem - type unknown")
  X <- sol$X
  X[which(abs(X) < Tol)] <- 0
  if (any(is.infinite(X))) {
    residual <- Inf
    solution <- Inf
  }
  else {
    residual <- 0
    if (Nin > 0) {
      ineq <- G %*% X - H
      residual <- residual - sum(ineq[ineq < 0])
    }
    if (Neq > 0) 
      residual <- residual + sum(abs(E %*% X - F))
    if (residual > Tol) 
      sol$IsError <- TRUE
    solution <- 0
    if (Napp > 0) 
      solution <- sum((A %*% X - B)^2)
  }
  xnames <- colnames(A)
  if (is.null(xnames)) 
    xnames <- colnames(E)
  if (is.null(xnames)) 
    xnames <- colnames(G)
  names(X) <- xnames
  res <- list(X = X, residualNorm = residual, solutionNorm = solution, 
    IsError = sol$IsError, type = "lsei")
  if (fulloutput && type == 1) {
    res$covar <- covar
    res$RankEq <- sol$IP[1]
    res$RankApp <- sol$IP[2]
  }
  return(res)
}


#Function written largely by Aaron Lun
#Key function to estimating the contributions
# ARGS:
# plate_XXXX is the counts matrix for the cells to be compared
# bc1, bc2 are vectors for the barcodes for the cells in each of the plates
# constrain == TRUE uses a constrained fit, all coef > 0
# n.genes is the number of genes to select (see above for explanation)
# RETURN:
# vector of coefficients, "other" is cells that share no barcodes, "rowcol" is cells that share one barcode
#   "self" is the cell that shares 2 barcodes (i.e. itself.)
get_mixing_matrix = function(plate_4000, plate_2500, bc1, bc2, constrain = FALSE, n.genes = 500, sample = NULL){
  nsamples = ncol(plate_4000)
  

  #Gene selection
  if(nrow(plate_4000)>1){
    #straight SD version
    # top_var = apply(plate_2500, 1, sd)
    
    #As a fraction of the 0.9 quantile
    #logic being that these are the genes that show the "crosshair" patterns
    top_var = apply(plate_2500, 1, function(x) max(x)/quantile(x, 0.9))
    
    #Max expression vs. second highest cell
    # top_var = apply(plate_2500, 1, function(x) max(x)/max(x[-which.max(x)]))
  
    # impose an expression limit - the gene must be expressed in a cell with at least 500 counts
    # (otherwise some of the methods above pick up genes expressed with, say, 
    # one or two counts on the whole plate, where swapping cannot actually happen)
    top_var = top_var[apply(plate_2500, 1, max)>500]
    if(n.genes > length(top_var)){
      top_genes = names(top_var)
      warning(paste("Asked for more genes than qualified by abundance; instead using all", length(top_var), "abundant genes"))
    } else {
      top_genes = names(top_var)[order(top_var, decreasing = TRUE)][1:n.genes]
    }
    
    if(!is.null(sample)){
      if(sample > n.genes)
        stop("You asked to sample more genes than selected in total!")
      
      top_genes = sample(top_genes, sample)
    }
      
    
    #perform subsetting
    mixed <- plate_4000[top_genes,]
    pure <- plate_2500[top_genes,]
  } else {
    #if only one gene was supplied we don't subset
    mixed = plate_4000
    pure = plate_2500
  }


  #establish relationships between cells RE whether they share a barcode  
  beta = matrix(NA, ncol(mixed), ncol(mixed))
  shares_barcode = sapply(1:length(beta), function(x) grepl(bc1[row(beta)[x]], colnames(mixed)[col(beta)[x]]) | grepl(bc2[row(beta)[x]], colnames(mixed)[col(beta)[x]]))
  
  beta = matrix(as.numeric(shares_barcode), nrow = ncol(mixed), ncol = ncol(mixed), dimnames = list(colnames(mixed), colnames(mixed)))
  
  diag(beta) = 2
  beta = beta+1
  
  #now the parameters we are fitting are: 1 - non row/col; 2 - row/col; 3 - self

  # Formulation of C_{4000} = M C_{2500} is "backwards" for a linear model - we need to solve for M, not C_{2500}
  # Therefore we formulate a new design matrix in order to solve M - this section of code does this
  
  # We have a complex set of contribution comparisons to make:
  # Rows are the possible gene-cell combinations
  # Columns are the sites-of-origin of counts i.e. col1 - shares no barcodes, col2 - shares one barcode, col3 - same cell
  
  nbeta <- max(beta)
  design <- responses <- vector("list", nrow(mixed))
  for (g in seq_len(nrow(mixed))) {
   collected.i <- collected.j <- collected.x <- vector("list", nsamples)
   for (i in seq_len(nsamples)) {
     collected.i[[i]] <- rep(i, nsamples)
     collected.j[[i]] <- beta[i,]
     collected.x[[i]] <- pure[g,]
   }
   collected <- sparseMatrix(i=unlist(collected.i), j=unlist(collected.j), x=unlist(collected.x))
   #staying with Matrix here leads to an integer overflow of some sort
   #coerce back to regular matrix
   design[[g]] <- as.matrix(collected)
   responses[[g]] <- mixed[g,]
  }
  
  design <- do.call(rbind, design)
  
  #downweight by abundance, to ensure it is not only a few highly expressed genes driving a fit
  if(nrow(mixed)>1){
  gene_means = rowMeans(pure)
  sqrts = sqrt(gene_means)
  repped = unlist(lapply(sqrts, rep, ncol(mixed)))
  
  design = sweep(design, 1, repped, "/")
  responses = Matrix(unlist(responses))/repped
  }
  
  
  if(!constrain){
    res = solve(qr(design), matrix(unlist(responses)))
    out = res[,1]#/sum(res)
  } else {
    #this implementation of lsei prevents fortran overflows but is otherwise identical
    res = my_lsei(A = as.matrix(design), B = unlist(responses), G = diag(ncol(design)), H = numeric(ncol(design)), type = 2)
    out = res$X#/sum(res$X)
  }
  
  
  names(out) = c("other", "rowcol", "self")

  return(out)
}

#utility function to extract barcodes from cell names
get_bcs = function(counts_mat){
  bcs = strsplit(colnames(counts_mat), ".", fixed = T)
  bc1 = sapply(strsplit(sapply(bcs, function(x) x[3]), "_"), function(y) y[1])
  bc2 = sapply(strsplit(sapply(bcs, function(x) x[3]), "_"), function(y) y[2])
  return(list(bc1, bc2))
}

bcs = lapply(swap_split, get_bcs)

#get the parameters
const = lapply(1:length(swap_split), function(i) get_mixing_matrix(swap_split[[i]], noswap_split[[i]], bc1 = bcs[[i]][[1]], bc2 = bcs[[i]][[2]], constrain = TRUE, n.genes = 500))
#rbind them together for all the plates
const = do.call(rbind, const)
#multiply up to total contributions
const_adjusted = sweep(const, 2, c(96-8-11, 11+7, 1), "*")
#normalise to one
const_adjusted = sweep(const_adjusted, 1, rowSums(const_adjusted), "/")


run_250 = lapply(1:length(swap_split), function(i) get_mixing_matrix(swap_split[[i]], noswap_split[[i]], bc1 = bcs[[i]][[1]], bc2 = bcs[[i]][[2]], constrain = TRUE, n.genes = 250))
run_250 = do.call(rbind, run_250)
adjust_250 = sweep(run_250, 2, c(96-8-11, 11+7, 1), "*")
adjust_250 = sweep(adjust_250, 1, rowSums(adjust_250), "/")


run_1000 = lapply(1:length(swap_split), function(i) get_mixing_matrix(swap_split[[i]], noswap_split[[i]], bc1 = bcs[[i]][[1]], bc2 = bcs[[i]][[2]], constrain = TRUE, n.genes = 1000))
run_1000 = do.call(rbind, run_1000)
adjust_1000 = sweep(run_1000, 2, c(96-8-11, 11+7, 1), "*")
adjust_1000 = sweep(adjust_1000, 1, rowSums(adjust_1000), "/")


run_sample = lapply(1:length(swap_split), function(i) get_mixing_matrix(swap_split[[i]], noswap_split[[i]], bc1 = bcs[[i]][[1]], bc2 = bcs[[i]][[2]], constrain = TRUE, n.genes = 1000, sample = 500))
run_sample = do.call(rbind, run_sample)
adjust_sample = sweep(run_sample, 2, c(96-8-11, 11+7, 1), "*")
adjust_sample = sweep(adjust_sample, 1, rowSums(adjust_sample), "/")

Across the 16 plates assayed, we acquired a distribution of fitted values (0 to 0.00279) for the single shared barcode contribution term.

To quantify how much of a cell’s transcriptome derives from all other cells that share exactly one or exactly no barcodes, we multiplied \(\beta\) and \(\gamma\) by 18 and 77, respectively. Here, 18 and 77 correspond to the number of entries in each column (or row) of \(M\) that are \(\beta\) or \(\gamma\), respectively. In other words, they are for each cell the number of other cells on the plate that share exactly one or no barcodes.

We then normalised the estimates to ensure that \(\alpha + \beta + \gamma = 1\), as we wish to examine the fraction of HiSeq 4000 counts that are derived from other cells, which is the swapping rate. Across plates, the mean contributions from the single barcode sharing cells (i.e. \(\beta\)) was estimated to be 2.068 \(\pm\) 0.326% (Figure 17).

Similar results were also obtained with 250 (2.06 \(\pm\) 0.298%) or 1000 (2.04 \(\pm\) 0.372%) genes considered per plate. Similarly, when 500 genes were chosen at random from the top 1000 ranked genes per plate, results were also concordant (2.04 \(\pm\) 0.367%)

hist(const_adjusted[,2], breaks = 20, xlim = c(0, 0.06), main = "", xlab = "Fitted fraction of transcriptome from other barcode-sharing cells")
Fitted values for single-barcode-sharing contributions. The contributions for cells that share one barcode are shown above, where replicates are for different plates of single cells. Different fitted values may derive from truly different swapping rates, or differences in the number of informative genes.

Figure 17: Fitted values for single-barcode-sharing contributions
The contributions for cells that share one barcode are shown above, where replicates are for different plates of single cells. Different fitted values may derive from truly different swapping rates, or differences in the number of informative genes.

These values provide an estimated increase in swapping rate in the HiSeq 4000 data compared to the HiSeq 2500 data, not an absolute estimate. Nonetheless, these values are consistent with the rate of swapping calculated in the Richard dataset above. Assuming that HiSeq 2500 swapping occurs at a rate of 1/10th the rate of HiSeq 4000 swapping (based on Richard data, and the crosshair patterns shown in Figures 10 and 11), we could estimate an absolute HiSeq 4000 swapping rate in this data by multiplication of the relative HiSeq 4000 estimate by a factor of 1.1. This gives an estimate from the Nestorowa data of 2.275 \(\pm\) 0.359%, which is consistent with the absolute swapping rate estimated from the Richard data of 2.187$$0.0765%.

Contributions for the cells that share no barcodes (i.e. using \(\gamma\)) are shown in Figure 18. Despite there being many more cells that could contribute in this manner, the total contribution is low, and in many cases estimated to be 0.

hist(const_adjusted[,1], breaks = 20, xlim = c(0, 0.06), main = "", xlab = "Fitted fraction of transcriptome from non-barcode-sharing cells")
Fitted values for single-barcode-sharing contributions. The contributions for cells that share one barcode are shown above, where replicates are for different plates of single cells.

Figure 18: Fitted values for single-barcode-sharing contributions
The contributions for cells that share one barcode are shown above, where replicates are for different plates of single cells.

rate_est = c(adjust_250[,2], 
             adjust_1000[,2], 
             const_adjusted[,2],
             adjust_sample[,2])

genes = c(rep(250, 16),
          rep(1000, 16),
          rep(500, 16),
          rep("500 Sampled", 16))

ggplot(data.frame(genes = genes, rate = rate_est, plate = rep(1:16, 4)), 
       aes(x = factor(plate), y = rate, col = factor(genes))) + 
  geom_point(size = 3) +
  theme_bw() +
  scale_color_brewer(palette = "Set1", name = "Number\nof genes\nconsidered") +
  labs(x = "Plate", y = "Row-column swapping rate estimate")
Swapping rate estimates for different numbers of considered genes. '500 Sampled' refers to 500 genes selected at random from the top 1000 genes. The variance of the estimated swapping rate for each plate for different numbers of considered genes is much smaller than the differences in swapping rate observed between plates.

Figure 19: Swapping rate estimates for different numbers of considered genes
‘500 Sampled’ refers to 500 genes selected at random from the top 1000 genes. The variance of the estimated swapping rate for each plate for different numbers of considered genes is much smaller than the differences in swapping rate observed between plates.

Additionally, the mean value of the gene selection score used (the maximum expression on each plate divided by the 90th percentile) for the top 500 genes (excluding Inf) does not affect the predicted swapping rate, shown in Figure 20. In other words, it is not the absence abundance of sporadically expressed genes that causes estimates of different swapping rates.

plates = 1:length(noswap_split)
gene_vars = lapply(plates, function(i)    
  apply(noswap_split[[i]][rowMax(noswap_split[[i]])>500,], 1,  
        function(x) max(x)/quantile(x, 0.9)))
gene_vars = lapply(gene_vars, function(x) x[order(x, decreasing = TRUE)])
gene_vars = lapply(gene_vars, function(x) x[x!=Inf])

means = sapply(gene_vars, mean)

ggplot(data.frame(mean = means, rate = const_adjusted[,2]),
       aes(x = mean, y = rate)) +
  geom_point() +
  theme_bw() +
  labs(x = "Mean gene score for top 500 genes (no Inf)",
       y = "Row-column swapping rate estimate")
Swapping rate estimation is independent of the selected gene influence. The mean gene score (the maximum expression on each plate divided by the 90th percentile) for the first 500 genes on each plate (excluding Inf) does not affect the estimated swapping rate.

Figure 20: Swapping rate estimation is independent of the selected gene influence
The mean gene score (the maximum expression on each plate divided by the 90th percentile) for the first 500 genes on each plate (excluding Inf) does not affect the estimated swapping rate.

3.3.1 Linking swapping rate to library characteristics

Sinha et al. (2017) increased the swapping rate by titrating larger and larger amounts of free barcode into their sequencing libraries. However, this does not explain by itself the diversity of swapping rates that we observe for conventionally designed sequencing libraries.

To understand swapping in this data, we have recorded the molarity of the library at different molecule lengths from the bioanalyzer software package. In particular, we have considered library barcodes as lengths 40-75bp, and sequenced cDNA as lengths 400-800bp. A screenshot from bioanalyzer’s Expert software is shown in Figure 21.

knitr::include_graphics(paste0(folder_location, "/figs/bioanalyzer.png"))
Bioanalyzer screenshot. The range 40-75bp corresponds to free DNA barcode, while cDNA lengths of 400-800bp are those that are expected to be sequenced in HiSeq 4000.

Figure 21: Bioanalyzer screenshot
The range 40-75bp corresponds to free DNA barcode, while cDNA lengths of 400-800bp are those that are expected to be sequenced in HiSeq 4000.

The molarities of each sample are shown in Figure 22, overlaid with the calculated swapping rates. The swapping rates shown here include single and double barcode swaps.

load_bio_csv = function(filepath){
  input = read.table(filepath, sep = ",", header = FALSE)
  #test for ranges
  from_col = which(input[1,] == "From [bp]")
  test1 = all(c(40, 400) %in% input[-1, from_col])
  to_col = which(input[1,] == "To [bp]")
  test2 = all(c(75, 800) %in% input[-1, to_col])

  if(!(test1 & test2))
    stop("You don't appear to have the correct ranges specified (40-75; 400-800)")

  #keep important columns
  cols = c("Average Size [bp]" = "avg_size_bp",
           "Conc. [pg/\xb5l]" = "conc_pgul",
           "From [bp]" = "from_bp",
           "To [bp]" = "to_bp",
           "Molarity [pmol/l]" = "molarity_pmoll")
  
  input = input[, match(names(cols), as.character(unlist(input[1,])))]

  names(input) = cols
  input = input[-1,]

  for(col in 1:ncol(input)){
    input[,col] = gsub(",", "", input[,col])
    input[,col] = as.numeric(as.character(input[,col]))
  }
  
  return(input)

}

#recall that we skip the last 4 plates as these were re-pooled
dfs = lapply(1:16, function(i) load_bio_csv(paste0(folder_location, "/data/bioanalyzer/sample", i, ".csv")))
names(dfs) = plate_names[1:16]

plot_df = as.data.frame(t(sapply(1:16, function(i) c(bc_mol = dfs[[i]][1, "molarity_pmoll"],
                                                     seq_mol = dfs[[i]][2, "molarity_pmoll"],
                                                     swap_rate = 1-as.numeric(const_adjusted[i, 3])))))

conc_df = plot_df

ggplot(plot_df, aes (x = bc_mol, y= seq_mol, col = swap_rate, size = swap_rate)) +
  geom_point() +
  scale_color_viridis(name = "Swap Rate") +
  theme_bw() +
  scale_size_continuous(name = "Swap Rate") +
  labs(x = "Barcode molarity (pmol/L)", y = "Sequenced cDNA molarity (pmol/L)")
bioanalyzer-calculated molarites and swapping rates

Figure 22: bioanalyzer-calculated molarites and swapping rates

There does not appear to be any consistent pattern across these measures. In Figure 23, the swapping rates are plotted directly against the molarity of free barcode.

ggplot(plot_df, aes(x = bc_mol, y= swap_rate)) +
  geom_point() +
  theme_bw() +
  labs(x = "Barcode molarity (pmol/L)", y = "Calculated swapping rate")
Swapping rate is not driven by free barcode molarity

Figure 23: Swapping rate is not driven by free barcode molarity

model = lm(data = plot_df, formula = swap_rate ~ I(bc_mol))

Notably, there is no obvious correlation (i.e. the gradent in a OLS linear model is not different from 0, p=0.427).

Similar behaviour is seen for the ratio of free barcode to captured cDNA, as shown in Figure 24.

ggplot(plot_df, aes(x = bc_mol/seq_mol, y= swap_rate)) +
  geom_point() +
  theme_bw() +
  labs(x = "Barcode molarity / Sequenced cDNA molarity", y = "Calculated swapping rate")
Swapping rate is not driven by excess of free barcode compared to sequenced cDNA

Figure 24: Swapping rate is not driven by excess of free barcode compared to sequenced cDNA

model = lm(data = plot_df, formula = swap_rate ~ I(bc_mol/seq_mol))

model_outlier = lm(data = plot_df[-which.max(plot_df$bc_mol/plot_df$seq_mol),], formula = swap_rate ~ I(bc_mol/seq_mol))

Linear models applied here also do not show a slope significantly different from 0 (p=0.129, p=0.466 with the high-ratio outlier removed from extreme right of plot).

The molarity calculations may also contain barcode concatamers, which will not align to the mouse genome and will therefore not be considered in our estimations of swapping rates. In Figure 25, we repeat the same ratio approach as in Figure 24, but use the total number of mapped reads per plate as a proxy for the amount of sequenced cDNA.

seq_proxy = sapply(1:16, function(x) sum(swap_split[[x]]))


ggplot(plot_df, aes(x = bc_mol/seq_proxy, y= swap_rate)) +
  geom_point() +
  theme_bw() +
  labs(x = "Barcode molarity / Total aligned reads", y = "Calculated swapping rate")
Swapping rate is not driven by excess of free barcode compared to sequenced cDNA. Here, the total number of alignable reads per plate has been considered to exclude barcode concatamers, which may be seen on the bioanalyzer trace.

Figure 25: Swapping rate is not driven by excess of free barcode compared to sequenced cDNA
Here, the total number of alignable reads per plate has been considered to exclude barcode concatamers, which may be seen on the bioanalyzer trace.

model = lm(data = plot_df, formula = swap_rate ~ I(bc_mol/seq_proxy))

However, we still do not see a slope signifiacntly different from 0 (p=0.452)

In summary, we do not observe our calculated swapping rates to be affected by the amount of free barcode in libraries, nor the ratio of free barcode to cDNA quantity. This contrasts with Sinha et al. (2017), who showed that titrations of increasing amounts of additional free primer (1nM, 10nM, 100nM) increased the rate of barcode swapping.

However, their most impressive result was that with 100nM of free barcode, which is far in excess of standard experimental quantities. Our analysis therefore suggests that the amount of free barcode has little bearing on the rate of swapping, when the amount of free barcode is at standard experimental levels - for example, in this data we observed concentrations of around 2-6nM.

3.4 Testing for transcriptome-wide swapping

In the crosshair plots shown in Figures 10 and 11, we showed a swapping pattern for a single gene for a single plate. We apply a model to each gene across all plates to identify whether swapping is happening transcriptome-wide. This differs from the previous model, which considered many genes together.

Let \(i\) denote the row index for a cell on the 96 well plate (\(1 \leq i \leq 8\)) and let \(j\) denote a cell’s column index (\(1 \leq j \leq 12\)). Subsequently, let \(C_{i,j,g}^{(t)}\) denote the read count of gene \(g\) in cell \((i,j)\) when profiled using sequencing technology \(t\), where \(t=2500\) refers to the HiSeq 2500 and \(t=4000\) refers to the HiSeq 4000. Additionally, let \[S_{i,j,g}^{(t)} = \sum_{k=1}^{12} C_{i,k,g}^{(t)} + \sum_{l=1}^{8} C_{l,j,g}^{(t)} - 2 C_{i,j,g}^{(t)}\] represent the total read count of gene \(g\) in technology \(t\) across all cells with index \(i\) or \(j\), excluding the cell with barcode combination \((i,j)\).

We first consider a null model without swapping where, for cell \((i,j)\), the number of reads mapped to each gene \(g\) in the HiSeq 4000 data set is only dependent on the number of reads mapped to the same cell in the HiSeq 2500 data. More specifically:

\[H_{0}: C_{i,j,g}^{4000} = \alpha_{g} C_{i,j,g}^{2500} + \epsilon_{i,j,g}\]

\(\alpha_g\) captures both read depth differences as well as any gene specific biases (e.g. GC content) that may change between the two sequencing machines.

As an alternative, we also consider an alternative model with a swapping term, where each cell in the HiSeq 4000 data receives and loses reads from swapping with cells that share exactly one barcode with it. Specifically, let:

\[H_{1}: C_{i,j,g}^{4000} = \alpha_g C_{i,j,g}^{2500} + \beta_g \left( S_{i,j,g}^{2500} - C_{i,j,g}^{2500} \right) + \epsilon_{i,j,g}\]

\(\beta_g\) allows the strength of the swapping term to vary between genes, so that we can identify whether all genes are affected by swapping or not. \(S_{i,j,g}^{2500}\) accounts for arrival of transcripts of gene \(g\) in a cell \((i,j)\) due to barcode swapping. Similarly, \(C_{i,j,g}^{2500}\) accounts for the loss of transcripts of gene \(g\) in cell \((i,j)\) due to barcode swapping (as without a barcode swap, a PCR duplicate would instead represent a new read for the gene in the original cell).

The model may also be parameterised as \[H_{1}: C_{i,j,g}^{4000} = \left(\alpha_g - \beta_g\right) C_{i,j,g}^{2500} + \beta_g S_{i,j,g}^{2500} + \epsilon_{i,j,g} = \alpha'_g C_{i,j,g}^{2500} + \beta_g S_{i,j,g}^{2500} + \epsilon_{i,j,g}\] but as the models are equivalent we fit the more swapping intuitive formulation presented above (i.e. including gain and loss of trancripts).

Models \(H_1\) and \(H_0\) are fitted by least squares, approximating the Poisson-distributed counts as normally distributed. We fitted to genes that are abundant (mean >50 counts across all plates) as swapping can only occur where there are sufficient molecules present to swap in the first place.

For each gene, evidence against \(H_{0}\) in favour of \(H_{1}\) can be established by performing an F-test (as the models are nested).

When \(H_{1}\) is favoured over \(H_{0}\), we are explaining the data better with the swapping term. However, the swapping term only operates in the way that we expect when \(\beta > 0\). If \(\beta < 0\), then the behaviour of swapping is inverted compared to that expected from the other work presented above.

When the model is applied, the vast majority of genes favour \(H_{1}\), with \(\beta>0\), as shown in Figure 26

#This chunk is very slow and computationally intensive. Change the min.mean setting in the line starting
# "gene_dfs = " if you need to speed it up.

#Note that these functions rely on the barcodes being in the names of the cells in a specific place

#returns the summed counts for all cells that share an index from the cell name, in the counts.
#char cell: cell name
#named vector named_count_row: a gene's row from the counts matrix, with cell names
get_summed = function(cell, named_count_row){
  barcodes = strsplit(strsplit(cell, split = ".", fixed = TRUE)[[1]][3],
                      split = "_")[[1]]
  row_bc = barcodes[2]
  col_bc = barcodes[1]
  
  #provides S_{gij} for the simplified model fit
  # summed = sum(named_count_row[grepl(row_bc, names(named_count_row)) | grepl(col_bc, names(named_count_row))], na.rm = TRUE) - 2* named_count_row[names(named_count_row) == cell]
  
  #provides S_{gij} - C_{gij} for the more complex model fit
  #-2C for the non-self-counting, -1C for the term to fit beta to
  summed = sum(named_count_row[grepl(row_bc, names(named_count_row)) | grepl(col_bc, names(named_count_row))], na.rm = TRUE) - 3* named_count_row[names(named_count_row) == cell]
  
  return(summed)
}

#given a row of counts for each of the 2500 and 4000 (i.e. a gene), function prepares
#a data frame of cell, count_4000 = 4000 counts, count_2500 = 2500 counts, summed_2500 = row/col summed counts
# i.e. wraps get_summed() across cells
process_gene_on_plate = function(count_row_2500, count_row_4000){
  
  summed = sapply(1:length(count_row_2500), function(x) get_summed(cell = names(count_row_2500[x]),
                                                                   named_count_row = count_row_2500))
  
  out = data.frame(cell = names(count_row_2500),
                   count_4000 = count_row_4000,
                   count_2500 = count_row_2500,
                   summed_2500 = summed)
  
  return(out)
}


#Given plate-specific data frames, returns a data frame for each gene therein with,
# for each cell, the values of the 2500 and 4000 counts, and the summed 2500 row/column
#i.e. wraps process_gene_on_plate() across genes
process_plate = function(plate_meta, plate_4000, plate_2500){
  if(any(rownames(plate_2500)!= rownames(plate_4000)))
    stop("Genes not the same on both plates")
  
  
  genes_4000 = split(as.data.frame(plate_4000), rownames(plate_4000))
  genes_4000 = lapply(genes_4000, function(x) unlist(x[1,]))
  
  genes_2500 = split(as.data.frame(plate_2500), rownames(plate_2500))
  genes_2500 = lapply(genes_2500, function(x) unlist(x[1,]))
  
  gene_dfs = lapply(1:length(genes_2500), function(x) 
    process_gene_on_plate(count_row_2500 = genes_2500[[x]], 
                          count_row_4000 = genes_4000[[x]])
    )
  names(gene_dfs) = names(genes_2500)
  
  return(gene_dfs)

}

#The function takes the full counts matrices, splits them by plate, and returns the
#dataframes ready to be modelled from, in a list
#all_meta needs $plate
#i.e. wraps process_plate() across all plates
do_all_plates = function(counts_4000, counts_2500, all_meta, min.mean = 50, n.cores = 3){
  require(parallel)
  
  keep_genes = which(rowMeans(counts_2500)>min.mean)
  
  split_meta = split(all_meta, all_meta$plate)
  split_4000 = lapply(split(as.data.frame(t(counts_4000[keep_genes,])), all_meta$plate), function(x) t(as.matrix(x)))
  split_2500 = lapply(split(as.data.frame(t(counts_2500[keep_genes,])), all_meta$plate), function(x) t(as.matrix(x)))
  
  #run over all plates
  #Critically, this returns the column summed_2500, which is defined by get_summed()
  # and will later be the predictor
  dataframes = mclapply(seq_along(split_4000), function(x) process_plate(plate_meta = split_meta[[x]],
                                                                        plate_4000 = split_4000[[x]],
                                                                        plate_2500 = split_2500[[x]]),
                        mc.cores = n.cores)
  
  
  #for each gene,make a master data frame
  n.genes = length(dataframes[[1]])
  gene_list = list()
  for(gene in 1:n.genes){
    
    #get the combined dataframe of the gene
    gene_df = data.frame()
    for(plate in 1:length(dataframes)){
      gene_df = rbind(gene_df, dataframes[[plate]][[gene]])
    }
    
    gene_df$response = gene_df$count_4000
    gene_df$predictor = gene_df$summed_2500
    #sometimes we cannot get a predictor variable i.e. there is not any possibiltiy for swapping when
    #the row/col sum= 0 (i.e. the gene does not exist for these cells)
    #so we remove these now
    # gene_df = gene_df[!is.nan(gene_df$predictor),]
    
    gene_list[[length(gene_list)+1]] = gene_df
    names(gene_list)[[length(gene_list)]] = names(dataframes[[1]])[gene]
  }
  
  return(gene_list)
  
}

#This function just makes models from its input list - from do_all_plates()
get_model_from_df_list = function(df_list){

  models = lapply(df_list, function(x) lm(x$response ~ 0 + x$count_2500 + x$predictor))
  
  return(models)
  
}

#takes the same list as get_model_from_df_list() but does the model comparison
get_anovas = function(df_list){
  full_mod = lapply(df_list, function(x) lm(x$response ~ 0 + x$count_2500 + x$predictor))
  small_mod = lapply(df_list, function(x) lm(x$response ~ 0 + x$count_2500))
  
  return(lapply(1:length(df_list), function(i) anova(small_mod[[i]], full_mod[[i]])))
}


#If you want to speed up this code, increase the min.mean to 500 or even higher, so fewer genes are considered
gene_dfs = do_all_plates(counts_4000 = swap, counts_2500 = noswap, all_meta = blood_meta, min.mean = 50, n.cores = ncores)

#for betas
models = get_model_from_df_list(gene_dfs)

anovas = get_anovas(gene_dfs)
#an alternative method vs. the Wald test
anova_ps = sapply(anovas, function(x) x$`Pr(>F)`[2])
anova_fdr = p.adjust(anova_ps, method = "fdr")
anova_frac = prop.table(table(p.adjust(anova_ps, method = "fdr")<0.05))[2]


gradients = sapply(models, function(x) coef(x)[2])

plot_df = data.frame(val = c("pos", "pos", "neg", "neg"), 
                     sig = c(TRUE, FALSE, TRUE, FALSE),
                     values = c(sum(gradients>0 & anova_fdr<0.05),
                                sum(gradients>0 & anova_fdr>0.05),
                                sum(gradients<0 & anova_fdr<0.05),
                                sum(gradients<0 & anova_fdr>0.05)))

ggplot(plot_df, aes(x = factor(sig, levels = c(TRUE, FALSE), ordered = TRUE), y = values)) +
  geom_bar(aes(fill = factor(val, levels = c("pos", "neg"), ordered = TRUE)), position = "dodge", stat = "identity") +
  scale_x_discrete(labels = c("Model fit improvement\nwith swapping term",
                              "No model fit improvement\nwith swapping term"),
                   name = "") +
  scale_y_continuous(name = "Number of genes") +
  theme_bw() +
  theme(panel.grid.major.x = element_blank(), 
        #panel.grid.minor.y = element_blank(), 
        axis.text = element_text(face = "bold", size = 10), 
        axis.title = element_text(size = 13, face = "bold")) +
  scale_fill_manual(values = c("#23809C", "#7A1305"), name = "", labels = c("Coefficient sign supports swapping\n(beta > 0)", "Coefficient sign opposes swapping\n(beta < 0)"))
Gene-wise swapping model fits. Gene-wise model fits are shown, separated by whether the swapping model is preferred over the null model. Y-axis values correspond to the number of genes in each category.

Figure 26: Gene-wise swapping model fits
Gene-wise model fits are shown, separated by whether the swapping model is preferred over the null model. Y-axis values correspond to the number of genes in each category.

support_frac = sum(gradients > 0 & anova_fdr < 0.05)/sum(anova_fdr < 0.05)
counts_lanes = as.matrix(read.csv(paste0(folder_location, "/data/wilson_2500.csv")))
counts_lanes = counts_lanes[rowSums(counts_lanes)>0,]
#split to the two sub_matrices
counts_1 = counts_lanes[,!grepl("_new", colnames(counts_lanes))]
counts_2 = counts_lanes[,grepl("_new", colnames(counts_lanes))]
colnames(counts_2) = gsub("_new", "", colnames(counts_2))

bc1s = sapply(strsplit(colnames(counts_1), "_"), function(x) x[1])
bc2s = sapply(strsplit(colnames(counts_1), "_"), function(x) x[2])


#for the two lane 2500
#match the name format to the other plates just for this
lane_1 = counts_1; colnames(lane_1) = paste0("some.name.", colnames(lane_1))
lane_2 = counts_2; colnames(lane_2) = paste0("some.name.", colnames(lane_2))
twolane_meta = data.frame(cell = c(colnames(lane_1)), plate = c(rep(1, ncol(lane_1))))
gene_dfs_twolane = do_all_plates(counts_4000 = lane_1, counts_2500 = lane_2, all_meta = twolane_meta, min.mean = 50, n.cores = 3)
models_twolane = get_model_from_df_list(gene_dfs_twolane)


anovas_twolane = get_anovas(gene_dfs_twolane)
anova_ps_twolane = sapply(anovas_twolane, function(x) x$`Pr(>F)`[2])
anova_frac_twolane = prop.table(table(p.adjust(anova_ps_twolane, method = "fdr")<0.05))[2]

90.5% of considered genes show a significant model fit improvement with the swapping term (FDR-corrected \(q<0.05\)). Of these genes, 97.9% have a positive value of \(\beta\), supporting the barcode swapping model.

This strongly suggests that barcode swapping is more prevalent on the HiSeq 4000 compared to the HiSeq 2500, and that it affects nearly all genes.

As a control for our model, we also applied it to data where identical library pools were sequenced on two lanes of a HiSeq 2500. Here, only 0.099% of genes have a significantly improved fit with the \(\beta\) term, as expected.

4 Droplet-based analyses

4.1 Introduction to method

New single-cell RNAseq protocols utilise microfluidic systems to automate stages of library preparation by capturing individual cells in droplets. Each run of the microfluidic system generates a sample, which typically contains thousands of cells. These droplet-based protocols label their cells in a different manner to plate based assays, as a cell barcode (unique to each droplet) is incorporated into the transcript alongside an additional Illumina barcode that labels different sets of cells, or samples. Therefore, swapping of Illumina barcodes will move transcripts between samples, while retaining the same cell identifier (Figure 27).

knitr::include_graphics(paste0(folder_location, "/figs/10x_bcs.pdf"))
10X barcode schematic. A 10X transcript contains multiple barcodes, including a 10X-supplied cell-labelling barcode, a randomly generated Unique Molecular Identifier (UMI), and an Illumina-supplied sample index. Only the sample index is expected to swap, leaving the cell barcode and UMI unchanged.

Figure 27: 10X barcode schematic
A 10X transcript contains multiple barcodes, including a 10X-supplied cell-labelling barcode, a randomly generated Unique Molecular Identifier (UMI), and an Illumina-supplied sample index. Only the sample index is expected to swap, leaving the cell barcode and UMI unchanged.

As discussed in the main text, swapping can have two effects depending on whether the cell barcodes are shared between the donor and recipient libraries. If they are shared, transcriptomes are homogenised. If they are not, artefactual cells may appear. Here, we investigate whether the barcode swapping phenomenon causes an excess of cell barcode sharing in droplet data.

The data we used were generated using the 10X Chromium system, where the manufacturers report that there are around 750,000 unique cell barcodes (Zheng et al. 2017). We find that there are 737,280 columns in the raw matrix output of CellRanger, which we take to be the total number of barcodes. If we make the assumption that cell barcodes are drawn at random for each sample, this allows us to formulate a null model for the rate of sharing.

We can test for deviations in the observed rate of sharing against the null using a hypergeometric test. This test uses the hypergeometric distribution, which describes the expeected number of “successes” for drawing without replacement. For our data, we apply sample pairwise testing, as particular samples may be more compromised by swapping than others.

For sample 1, we have \(n_1\) barcodes that have been called as cells, and for sample 2 \(n_2\) barcodes that have been called as cells. The 737,280 cell barcodes supplied by 10X is referred to as \(N\). The number of shared barcodes is \(S\).

We consider the sampling of sample 2’s \(n_2\) barcodes as repeated drawing from the \(N\) set of barcodes, of which the drawing of \(n_1\) particular barcodes (those from sample 1) consitute a success.

Under the null hypothesis, cell barcodes are drawn at random, and there is no increased incidence of barcode sharing. In this case, the rate of sharing of barcodes between samples should follow the hypergeometric distribution for each sample’s values of \(n\). We test the observed rate of sharing \(S\) against the distribution i.e. how often would we expect to see a result as severe or more severe (i.e. equally many or more shared barcodes) given random drawing of barcodes from the pool?

#thanks to Karsten Bach for code
# Computation of p-values based on hypergeometric test
compare <- function(barcodes, samples) {
    out <- NULL
    samp_names = unique(samples)
    samples <- as.numeric(as.factor(samples))
    ids <- levels(factor(samples, levels = seq_along(unique(samples))))
    combs <- combn(ids,m=2, simplify=FALSE)
    for (i in seq_along(combs)) {
    comb <- combs[[i]]
    s1 <- comb[1]
    s2 <- comb[2]
    bc1 <- as.character(barcodes[samples==s1])
    bc2 <- as.character(barcodes[samples==s2])
    x <- length(intersect(bc1,bc2))
    m <- length(bc1)
    n <- 737280 #taken from ncol(raw_matrix.mtx) #old value used: 750000
    k <- length(bc2)
    p.val <- phyper(q=x-1,m=m,n=n-m,k=k,lower.tail=FALSE) #old: n = n
    tmp <- data.frame(s1=s1,
              s2=s2,
              n1=m,
              n2=k,
              shared=x,
              p=p.val)
    out <- rbind(out,tmp)
    }
    
    out$s1 = samp_names[as.numeric(as.character(out$s1))]
    out$s2 = samp_names[as.numeric(as.character(out$s2))]
    return(out)
}

4.2 HiSeq 2500 cell barcode sharing

4.2.1 Dataset 1

drop_2500 = read.table(paste0(folder_location, "/data/tenx_2500_1.tab"), stringsAsFactors = FALSE)
split = strsplit(drop_2500[,1], "-")
barcodes = sapply(split, function(x) x[1])
samples = sapply(split, function(x) x[2])

Here, droplet data was generated from mouse embryonic cells using the 10X Genomics Chromium system. Samples were sequenced on lanes of a HiSeq 2500. 11 samples were multiplexed, varying in size between 3776 and 313 called cells. CellRanger 1.3.1 was used for sample processing with default arguments. Figure 28 shows p-values from the hypergeometric tests between every pair of samples.

# good_comp = pairwise_comparisons(barcodes = barcodes, samples = samples, n.cores = 3)
good_comp = compare(barcodes, samples)
hist(good_comp$p, xlab = "p", main = "HiSeq 2500, embryonic cells", col = "darkgrey")
HiSeq 2500 sharing p-values. Hypergeometric test p-values between samples are shown for the HiSeq 2500, embryonic cell dataset.

Figure 28: HiSeq 2500 sharing p-values
Hypergeometric test p-values between samples are shown for the HiSeq 2500, embryonic cell dataset.

if(any(p.adjust(good_comp$p, method = "fdr")!=1))
  print("There are non-1 adjusted p-values")

None of the p-values for any comparison is significant after FDR correction at any significance level (all \(q=1\)). The high density of \(p=1\) values arise when two samples share no barcodes. Therefore we do not observe any excess of sharing in this data.

4.2.2 Dataset 2

samp2_2500 = read.table(paste0(folder_location, "/data/tenx_2500_2.tab"), header = TRUE)

result = compare(samp2_2500$barcode[!grepl("B", samp2_2500$SampleID)], samp2_2500$SampleID[!grepl("B", samp2_2500$SampleID)])

We now apply the same set of pairwise comparisons to a dataset of mouse epithelial cells, once again processed using the 10X Chromium system, and sequenced on a HiSeq 4000. CellRanger 1.2.1 was used for data processing with default arguments. The six samples range in size between 1102 and 135 called cells. Two samples were excluded from analysis here due to failed library preparation. Only 2 barcodes were shared between samples, so p-values are almost universally 1, as shown in Figure 29.

hist(result$p, main = "HiSeq 2500, mouse epithelial cells", xlab = "p-value", breaks = 20, xlim = c(0,1), col = "darkgrey")
HiSeq 2500 sharing p-values. Hypergeometric test p-values between samples are shown for the HiSeq 2500, mouse epithelial cell dataset.

Figure 29: HiSeq 2500 sharing p-values
Hypergeometric test p-values between samples are shown for the HiSeq 2500, mouse epithelial cell dataset.

4.3 HiSeq 4000 cell barcode sharing

4.3.1 Dataset 2

samp2_all = read.table(paste0(folder_location, "/data/tenx_4000_2.csv"), stringsAsFactors = FALSE, sep = ",", header = T)
samp2_all$barcode = sapply(strsplit(samp2_all$barcode,"-"), function(x) x[1])
samp2_all$quantile = NA
for(i in 1:nrow(samp2_all)){
  samp = samp2_all$SampleID[i]
  rem = samp2_all[-i,]
  umis = rem$UmiSums[which(rem$SampleID == samp)]
  samp2_all$quantile[i] = sum(umis<samp2_all$UmiSums[i])/length(umis)
}

samp2 = samp2_all[!samp2_all$SampleID%in%c("B1", "B2"),]

bad_comp = compare(samp2$barcode, as.numeric(factor(samp2$SampleID)))

The same mouse epithelial cell dataset (Dataset 2, above) was also sequenced on HiSeq 4000. In this data, the six samples range in size between 1111 and 151 called cells. Two samples were again excluded from analysis here due to failed library preparation. Here we observe low p-values in all pairwise comparisons (Figure 30).

hist(bad_comp$p, xlab = "p", xlim =c(0,0.01), breaks = 10, main = "HiSeq 4000, mouse epithelial cells", col = "darkgrey")
HiSeq 4000 sharing p-values. Hypergeometric test p-values between samples are shown for the HiSeq 4000, mouse epithelial cell dataset. Note the x-axis scale, which differs from the other histograms.

Figure 30: HiSeq 4000 sharing p-values
Hypergeometric test p-values between samples are shown for the HiSeq 4000, mouse epithelial cell dataset. Note the x-axis scale, which differs from the other histograms.

Despite this, the absolute rate of barcode sharing is still low: of 2950 barcodes, only 10 barcodes occur more than once

4.3.2 Dataset 3

samp3 = read.table(paste0(folder_location, "/data/tenx_4000_3.tab"), header = T)
samp3_comp = compare(samp3$barcode, samp3$sample)

sig_row = which.min(samp3_comp$p)

This data was also generated using the 10X Genomics Chromium system, with four samples of human xenograft cells varying in size between 4608 and 1462 called cells. Libraries were sequenced on a HiSeq 4000, and data was processed using CellRanger 1.3.1 using default arguments.

Of the 9621 barcodes, 16 barcodes were observed twice, with none observed three times or more. Our pairwise comparison method identifies one significant excess of sharing after FDR correction (q = 0.0594). Here, 9 barcodes were shared compared to an expected value of 3.45 for samples of size 1771 and 1462.

Figure 31 shows p-values from the hypergeometric tests between every pair of samples.

hist(samp3_comp$p, main = "HiSeq 4000, xenograft cells", xlab = "p-value", breaks = 20, col = "darkgrey")
HiSeq 4000 sharing p-values. Hypergeometric test p-values between samples are shown for the HiSeq 4000, xenograft cell dataset.

Figure 31: HiSeq 4000 sharing p-values
Hypergeometric test p-values between samples are shown for the HiSeq 4000, xenograft cell dataset.

In both HiSeq 4000 datasets, we observe excess barcode sharing between samples. However, the actual number of shared cell barcodes remains low. We hypothesise that, due to the low rate of barcode swapping, swapped-in cell libraries (the potentially artefactual libraries) are very small compared to the libraries of real cells. Because the libraries are small, they are typically ignored by algorithms that call cellw, as these often depend on library sizes of individual cell barcode libraries - small libraries do not look like real cells. It is therefore plausible that droplet data has an intrinsic robustness to the generation of artefactual cells, at least when samples contain cells of comparable size, and are of high quality.

4.4 Barcode swapping can create “shadow-cells” in compromised samples

This section describes an experiment where barcode-swapping induced severe technical artefacts which may confound analysis.

In the second droplet dataset above (mouse epithelial cells), sample labelling combines experimental condition (A-D) with biological replicate number (1-2). Samples B1 and B2 possess considerably smaller library sizes (Figure 32), and share almost all of their cell barcodes with each other and with other samples that were multiplexed with them (Figure 33). Accordingly, samples B1 and B2 were excluded from the earlier analysis (Figure 30).

bc_tab = table(samp2_all$barcode)


ggplot(samp2_all, aes(x = factor(SampleID), y = UmiSums, fill = substr(SampleID, 1, 1))) +
  geom_violin() +
  scale_y_log10(labels = c("1,000", "10,000", "100,000"), breaks = 10^(3:5)) +
  theme_bw() +
  theme(legend.position = "none", panel.grid.minor.x = element_blank(), panel.grid.major.x = element_blank(), panel.grid.minor.y = element_blank(), 
        axis.text.y = element_text(size = 10), axis.text.x = element_text(size = 12, face = "bold"), axis.title = element_text(face = "bold", size = 15)) +
  labs(x = "", y = "Library size") +
  scale_fill_brewer(palette = "Set2")
Sample library sizes. Samples B1 and B2 have considerably smaller library sizes than other samples.

Figure 32: Sample library sizes
Samples B1 and B2 have considerably smaller library sizes than other samples.

samp2_all$n = sapply(samp2_all$barcode, function(x) bc_tab[x])
samp2_all$n_cut =  samp2_all$n>1 +1

b1_barcodes = samp2_all$barcode[samp2_all$SampleID =="B1"]
b2_barcodes = samp2_all$barcode[samp2_all$SampleID =="B2"]

samp2_all$in_b = samp2_all$barcode %in% b1_barcodes & samp2_all$barcode %in% b2_barcodes
ggplot(samp2_all, aes(x = SampleID)) +
  geom_bar(aes(fill = factor(in_b)), stat = "count") +
scale_fill_manual(values = c("#E8CF76", "#383F70"), name = "", labels = c("Barcode not present\nin both B samples", "Barcode present\nin both B samples")) +  
  theme_bw() + theme(panel.grid.minor.x = element_blank(), 
                     panel.grid.major.x = element_blank(), 
                     panel.grid.minor.y = element_blank(),
                     axis.text.y = element_text(size = 10), 
                     axis.text.x = element_text(size = 12, face = "bold"), 
                     axis.title = element_text(face = "bold", size = 15)) +
  labs(x = "Sample", y = "Number of called cells") +
  scale_y_continuous(breaks = seq(0, 1250, 250), labels = c("0", "250", "500", "750", "1,000", "1,250"))
Barcode sharing between samples. Samples B1 and B2 share most of their barcodes, and many barcodes observed in other samples are also observed in samples B1 and B2.

Figure 33: Barcode sharing between samples
Samples B1 and B2 share most of their barcodes, and many barcodes observed in other samples are also observed in samples B1 and B2.

sample_share = length(intersect(samp2_all$barcode[samp2_all$SampleID=="B1"],
                                samp2_all$barcode[samp2_all$SampleID=="B2"]))/
               length(union(samp2_all$barcode[samp2_all$SampleID=="B1"],
                            samp2_all$barcode[samp2_all$SampleID=="B2"]))



fail_bc = samp2_all[grepl("B", samp2_all$SampleID), "barcode"]

samp2$swapped_in = samp2$barcode%in%fail_bc

test = wilcox.test(samp2$UmiSums[samp2$swapped_in], samp2$UmiSums[!samp2$swapped_in])

Samples B1 and B2 share 94% of all their cell barcodes with each other, and yet apparently contain more cells than any of the other samples. This behaviour is clearly odd. What we believe has happened is that the cells in samples B1 and B2 lysed either prior to or during library preparation in the 10X Chromium system. This has resulted in the very small library sizes observed for cells in these samples: the actual amount of input was very low. Because the real cell libraries in the samples were very small, the transcripts swapped from other samples formed libraries that were large relative to the average library size in the B samples. These swapped libraries were therefore considered to be cells, despite being derived only from other samples. This is supported by the excessive sharing of barcodes between the two B samples.

Why have these small libraries been called as cells? This has been driven by CellRanger’s cell calling algorithm: it calls cells by “selecting 10X barcodes with a total UMI count of >= 10% of the 99th percentile of the expected recovered cells”. Where the 99th percentile happens to be low, many cells of low library size will be called. As a result, CellRanger calls cells regardless of whether libraries are either noisy, of poor quality, or derived from swapping.

Finally, note that not all barcodes from high-quality samples are observed in the poor quality B samples. What may drive a barcode’s presence or absence? Consider differences in library size: a large barcode library has more total cDNA in the sequencer, and will therefore contribute more swapping overall into other samples. A smaller library will contribute more overall. Therefore, when calling cells based on size, we would be more likely to call the large library as a cell, because it is larger.

Working in reverse, barcodes from high-quality samples that are also called as cells in the B samples should therefore have larger libraries than other barcodes who are present in only the high-quality samples. If this is the case, it confirms that the cells called in B are largely derived from swapping artefacts. Indeed, as shown in Figure @ref{fig:smoking-gun}, this is correct. The shared-barcode libraries do have significantly more molecules (p = 9.96e-259, with a ratio of 2.2 between the medians).

ggplot(samp2, aes(x=swapped_in, y = UmiSums)) +
  geom_boxplot() +
  theme_bw() +
  scale_y_log10(breaks = c(1000, 10000, 100000), labels = c("1,000", "10,000", "100,000"), name = "Library size", lim = c(min(samp2$UmiSums)*0.8, max(samp2$UmiSums)*1.5)) +
  scale_x_discrete(labels = c("Unique cell\nbarcode", "Shared cell\nbarcode"), name = "") +
  geom_signif(comparisons = list(c(1, 2)), annotations = "***") +
  theme(panel.grid.major.x = element_blank(), axis.text.y = element_text(size = 10), axis.text.x = element_text(size = 12, face = "bold"), axis.title = element_text(face = "bold", size = 15))
Library sizes of shared and unshared cell barcodes, from high-quality samples. Cells that have their barcode shared between a high quality and poor quality sample ("Shared cell barcode") are larger than those that do not ("Unique cell barcode"). This suggests that larger cells have swapped into the poor quality samples on account of their larger size - they can contribute larger absolute amounts of RNA.

Figure 34: Library sizes of shared and unshared cell barcodes, from high-quality samples
Cells that have their barcode shared between a high quality and poor quality sample (“Shared cell barcode”) are larger than those that do not (“Unique cell barcode”). This suggests that larger cells have swapped into the poor quality samples on account of their larger size - they can contribute larger absolute amounts of RNA.

This effect has been observed in data that already faced severely compromised samples (i.e. lysed cells). However, it is possible that the multiplexing of cells of radically different RNA content could present similar problems. We advise caution when combining experimental designs of this nature with HiSeq 4000 (or other patterned flow-cell) sequencing.

5 Removing barcode-swapped artefacts from droplet data

5.1 Cell exclusion

We recommend using a test for the degree of barcode sharing between samples as a standard part of droplet-based single-cell sequencing quality control. The method we have presented above (hypergeometric test on pairwise comparisons) is quick and easy to apply.

If barcode swapping problems are observed, the easiest approach to remove the effects is to remove any cells labelled with a barcode that exists in more than one sample in each multiplexed sequencing run. This procedure will remove the artefacts, regardless of whether the cells truly exist in both samples (which will cause transcriptome mixing) or whether artefactual “fake cells” are being created.

The cost of such a procedure will be the loss of cells for downstream analysis, even in the absence of swapping. However, in a conventionally sized experiment (<4,000 cells per sample), the expected rate of removal is low.

To understand the effects of excluding cell barcodes on scRNA-seq data, we have run simulations. Specifically, we consider a simulated experiment where 8 samples are multiplexed together. Each sample consists of an equal number of cells \(n\), whose barcodes are drawn at random without replacement from 10X’s set of 737,280. Barcodes are then marked for exclusion if they are observed in more than one of the eight samples. Figure 35 shows the rate of removal according to this procedure for different values of \(n\), assuming no additional creation of “fake cells” (i.e. drawing of barcodes entirely at random, swap-free). For each \(n\), 100,000 simulations were run. Clearly, such a removal method will not be suitable for datasets with many cells per sample, or very numerous samples.

do_sim_remove = function(sample_size = 10000, n.samples = 8, n.barcodes = 737280){
  samples = sapply(1:n.samples, function(x) sample(x = n.barcodes, size = sample_size, replace = FALSE))
  
  dups = samples[base::duplicated(as.numeric(samples))]
  
  removal_rate = sum(samples%in%dups)/length(samples)
  
  return(removal_rate)
  
}


sizes = c(1000, 2000, 4000, 8000, 12000, 24000)
nsim = 100

sim_results = lapply(sizes, function(x) sapply(1:nsim, function(y) do_sim_remove(sample_size = x, n.samples = 8)))

size_column = factor(rep(sizes, nsim), levels = sizes, ordered = TRUE)

plot_df = data.frame(rate = unlist(sim_results),
                     size = size_column[order(size_column)])

ggplot(plot_df, mapping = aes(x = size, y = rate)) +
  geom_boxplot() +
  labs(x= "Cells per sample", y = "Removal rate") +
  theme_bw() +
  scale_y_continuous(breaks = seq(0, 0.2, 0.05), labels = paste0(seq(0, 20, 5), "%"))
Removal rate for experiments of different sizes. If duplicate cell barcodes are removed, there is a loss of power via cell exclusion. Plotted is the expected rate of removal for experiments of different sizes: 8 multiplexed samples, each of the size given on the x-axis are considered. These estimates assume no swapping.

Figure 35: Removal rate for experiments of different sizes
If duplicate cell barcodes are removed, there is a loss of power via cell exclusion. Plotted is the expected rate of removal for experiments of different sizes: 8 multiplexed samples, each of the size given on the x-axis are considered. These estimates assume no swapping.

5.2 Molecule exclusion

Removal of cells, however, is a wasteful solution, particularly for larger samples. Because it is cDNA molecules that swap on the sequencing machines, operating at the molecule level offers a more precise solution.

For example, in a 10X Genomics single-cell experiment, transcripts are generated that contain:

  • An Illumina sample barcode
  • A cell barcode, drawn from a pool of ~737,280
  • A unique molecular identifier (UMI): a 10bp random sequence, providing 1,048,576 combinations
  • Gene sequence from the captured mRNA.

The chance of observing a read in two different samples with the same cell barcode, UMI, and gene alignment is therefore extremely low, due to the large amount of combinatorial complexity generated by this combination.

We now consider the mouse epithelial cells sequenced on HiSeq 4000 (HiSeq 4000 Dataset 2, above). Specifically, we have identified molecules that showed shared UMI, cell barcode, and aligned gene between different samples. From these molecules, we calculated the fraction of all reads (i.e. observed across all samples) that were observed in each sample. Plotted in Figure 36 is the largest observed read fraction for each of the molecules observed in more than one sample.

h5_files = dir(paste0(folder_location, "/data/molecule_info/hiseq_4000"), full.names = TRUE)

cleaned_4000 = swappedDrops(h5_files, get.diagnostics = TRUE, get.swapped = TRUE)
reads_4000 = cleaned_4000$diagnostics

large_frac = apply(reads_4000, 1, max)/Matrix::rowSums(reads_4000)
large_frac = large_frac[large_frac!=1]


ggplot(mapping = aes(x = large_frac)) +
  geom_density() +
  geom_vline(xintercept = c(1/2, 2/3, 3/4, 4/5)) +
  geom_label(data = data.frame(lab = c("1/2", "2/3", "3/4"), 
                               X = c(1/2, 2/3, 3/4),
                               Y = rep(6, 3)),
             mapping = aes(x = X, y=Y, label = lab),
             nudge_x = 0.0) +
  theme_bw() +
  labs(x = "Largest read fraction", y = "Density")
Density of the largest read fraction for swapped molecules. The read fraction is the fraction of all reads for a molecule (cell barcode, UMI, aligned gene identical) that were observed in each sample. The largest fraction is simply the fraction in the sample that contained the most reads for each molecule. Only molecules that were observed in more than one sample are shown.

Figure 36: Density of the largest read fraction for swapped molecules
The read fraction is the fraction of all reads for a molecule (cell barcode, UMI, aligned gene identical) that were observed in each sample. The largest fraction is simply the fraction in the sample that contained the most reads for each molecule. Only molecules that were observed in more than one sample are shown.

There are several things to notice about this plot

  • There is periodic density at particular values. Three of these are annotated - a largest read fraction of 0.5 may indicate molecules with one read in each of two samples; a value of 0.66 indicates molecules with two reads in one sample, and one in another, and so on.

  • The largest amount of density is present for values close to 1. This indicates molecules where the vast majority of their reads are allocated to a single sample. This is the sort of read distribution that we would expect to see after barcode swapping. Remember that, after UMI decomposition, each sample would be allocated a single gene count despite the larger number of reads from one sample.

To further illustrate the high density at large read fractions, a cumulative distribution function is shown in Figure 37.

ggplot(mapping = aes(x = large_frac)) +
  stat_ecdf() +
  theme_bw() +
  labs(x = "Largest read fraction", y = "Cumulative distribution")
Cumulative distribution of the largest read fraction for swapped molecules. The read fraction is the fraction of all reads for a molecule (cell barcode, UMI, aligned gene identical) that were observed in each sample. The largest fraction is simply the fraction in the sample that contained the most reads for each molecule. Only molecules that were observed in more than one sample are shown.

Figure 37: Cumulative distribution of the largest read fraction for swapped molecules
The read fraction is the fraction of all reads for a molecule (cell barcode, UMI, aligned gene identical) that were observed in each sample. The largest fraction is simply the fraction in the sample that contained the most reads for each molecule. Only molecules that were observed in more than one sample are shown.

Note that well over 80% of swapped molecules show a largest read fraction of above 0.8, and over 60% above 0.9.

The same libraries were re-sequenced on HiSeq 2500. Figures 38 and 39 overlay the HiSeq 2500 data over the HiSeq 4000 data, showing a reduced level of swapping. Table 1 presents summary values between the two sets of data.

h5_files_2500 = dir(paste0(folder_location, "/data/molecule_info/hiseq_2500"), full.names = TRUE)

cleaned_2500 = swappedDrops(h5_files_2500, get.diagnostics = TRUE, get.swapped = TRUE)
reads_2500 = cleaned_2500$diagnostics

tab = data.frame(row.names = c("Hiseq4000", "HiSeq2500"), 
                 Reads = format(c(sum(reads_4000), sum(reads_2500)), big.mark = ","),
                 Molecules = format(c(nrow(reads_4000), nrow(reads_2500)), big.mark = ","),
                 "Swapped fraction" = format(c(sum(Matrix::rowSums(reads_4000>0)>1)/nrow(reads_4000),
                                             sum(Matrix::rowSums(reads_2500>0)>1)/nrow(reads_4000)),
                                             digits = 3))

kable(tab, caption = "Library summary statistics. Swapped fraction is the fraction of molecules (identical UMI, cell barcode, aligned gene) observed in more than one sample.", col.names = c("Reads", "Molecules", "Swapped fraction"))
Table 1: Library summary statistics
Swapped fraction is the fraction of molecules (identical UMI, cell barcode, aligned gene) observed in more than one sample.
Reads Molecules Swapped fraction
Hiseq4000 626,168,518 53,636,695 0.071033
HiSeq2500 268,768,037 44,062,616 0.000487
large_frac_2500 = apply(reads_2500, 1, max)/Matrix::rowSums(reads_2500)
large_frac_2500 = large_frac_2500[large_frac_2500!=1]

ggplot() +
  geom_density(mapping = aes(x = large_frac), col = "black") +
  geom_density(mapping = aes(x = large_frac_2500), col = "red") +
  theme_bw() +
  labs(x = "Largest read fraction", y = "Density")
Density of the largest read fraction for swapped molecules. HiSeq 2500 is shown in red, HiSeq 4000 in black. The read fraction is the fraction of all reads for a molecule (cell barcode, UMI, aligned gene identical) that were observed in each sample. The largest fraction is simply the fraction in the sample that contained the most reads for each molecule. Only molecules that were observed in more than one sample are shown.

Figure 38: Density of the largest read fraction for swapped molecules
HiSeq 2500 is shown in red, HiSeq 4000 in black. The read fraction is the fraction of all reads for a molecule (cell barcode, UMI, aligned gene identical) that were observed in each sample. The largest fraction is simply the fraction in the sample that contained the most reads for each molecule. Only molecules that were observed in more than one sample are shown.

ggplot() +
  stat_ecdf(mapping = aes(x = large_frac), col = "black") +
  stat_ecdf(mapping = aes(x = large_frac_2500), col = "red") +
  theme_bw() +
  labs(x = "Largest read fraction", y = "Cumulative distribution")
Cumulative distribution of the largest read fraction for swapped molecules. HiSeq 2500 is shown in red, HiSeq 4000 in black. The read fraction is the fraction of all reads for a molecule (cell barcode, UMI, aligned gene identical) that were observed in each sample. The largest fraction is simply the fraction in the sample that contained the most reads for each molecule. Only molecules that were observed in more than one sample are shown.

Figure 39: Cumulative distribution of the largest read fraction for swapped molecules
HiSeq 2500 is shown in red, HiSeq 4000 in black. The read fraction is the fraction of all reads for a molecule (cell barcode, UMI, aligned gene identical) that were observed in each sample. The largest fraction is simply the fraction in the sample that contained the most reads for each molecule. Only molecules that were observed in more than one sample are shown.


We have implemented an algorithm to exclude molecules that were swapped between single-cell 10X Genomics libraries in the DropletUtils package, currently available on GitHub.

We:

  • Identify molecules that are present in two or more different samples, yet contain the same cell barcode, UMI, and aligned gene

  • Calculate the fraction of all reads observed for these molecules that was observed in each of the samples in which it was detected

  • If the reads derive mostly from a single sample (largest read fraction \(\geq\) 0.8), we assume that the reads detected in other samples swapped from this sample-of-origin. We exclude the molecule count from all samples other than the sample-of-origin.

  • If the reads are relatively evenly spread across samples (largest read fraction \(<\) 0.8), we cannot reliably identify the sample from which the molecule originated. We therefore remove the molecule count from all samples.

After running the method, we exclude 0.715% of molecules in the HiSeq 4000 data.

To test the effect of molecule exclusion on the data, we call cells from the processed and raw count matrices. If we are successfully removing swapping from the data, we should remove the artefactual cells that we identified in Samples B1 and B2 above (see section 4.4, Barcode swapping can create “shadow-cells” in compromised samples). Specifically, we call cells using emptyDrops (part of the DropletUtils package), specifiying an FDR-corrected \(q<0.01\) and a minimum library size of 1000 molecules. This method identifies a background RNA distribution in the data, and scores cell presence based on transcriptome divergence from the background. The results of cell calling are shown in Table 2, and illustrated graphically in Figure 40

call_cells = function(expr.matrix){

  test = emptyDrops(expr.matrix, test.args = list(niters = 30000))
  
  test = test[!is.na(test$FDR),]
  
  sig = sum(test$FDR <= 0.01 & test$Total >= 1000)
  
  return(sig)
  
}

sample_map = unique(samp2_all$SampleID)
names(sample_map) = c("A1", "B1", "C1", "D1", "E1", "F1", "G1", "H1")



tab = data.frame(sample = sample_map, 
                 cells_pre = sapply(seq_along(cleaned_4000$cleaned), 
                                    function(i) call_cells(cleaned_4000$cleaned[[i]] + cleaned_4000$swapped[[i]])),
                 cells_post = sapply(cleaned_4000$cleaned, call_cells))

cellcall = tab

kable(tab, row.names = FALSE, col.names = c("Sample", "All molecules", "Unswapped molecules"), caption = "Number of cells called before and after swapped molecule processing.")
Table 2: Number of cells called before and after swapped molecule processing
Sample All molecules Unswapped molecules
C1 659 635
A2 483 456
D2 161 143
A1 347 344
B2 279 14
B1 309 12
C2 1214 1203
D1 581 534
ggplot(melt(tab), aes(x = sample, fill = variable, y = value)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_bw() +
  scale_fill_brewer(palette = "Set2", name = "", label = c("Raw counts", "Unswapped\ncounts")) +
  labs(x = "Sample", y = "Number of called cells")
Number of cells called in each sample before and after swapped molecule removal. Note that the number of called cells is different to those recorded earlier in this document, as this figure considers cells called by emptyDrops, and the earlier figure by CellRanger.

Figure 40: Number of cells called in each sample before and after swapped molecule removal
Note that the number of called cells is different to those recorded earlier in this document, as this figure considers cells called by emptyDrops, and the earlier figure by CellRanger.

n_change = sum(tab$cells_pre[!grepl("B", tab$sample)] - tab$cells_post[!grepl("B", tab$sample)])
n_healthy = sum(tab$cells_pre[!grepl("B", tab$sample)])

The method has effectively removed the artefactual cells without considerably affecting cells called in other samples, as would have been the case with the cell-exclusion method. This suggests that, once swapped molecules are removed, the barcode libraries return to their background-like appearance. This in turn suggests that our unswapping approach is working as desired.

Interestingly, the method has also removed 130 cells from the “healthy” samples (3.77% of pre-correction cells), suggesting that swapping is also creating artefacts in conventional libraries at a low rate.

Note that when calling like CellRanger does, We are vulnerable to the lack of input whether or not we have removed swapped transcripts. This is shown in Figure 41. We therefore recommend emptyDrops, which appears to be more robust.

tab = data.frame(sample = sample_map, 
                 cells_pre = sapply(seq_along(cleaned_4000$cleaned), 
                                    function(i) sum(defaultDrops(cleaned_4000$cleaned[[i]] + cleaned_4000$swapped[[i]]))),
                 cells_post = sapply(cleaned_4000$cleaned, 
                                     function(x) sum(defaultDrops(x))))


# kable(tab, row.names = FALSE, col.names = c("Sample", "All molecules", "Unswapped molecules"), caption = "Number of cells called before and after swapped molecule processing.")

ggplot(melt(tab), aes(x = sample, fill = variable, y = value)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_bw() +
  scale_fill_brewer(palette = "Set2", name = "", label = c("Raw counts", "Unswapped\ncounts")) +
  labs(x = "Sample", y = "Number of called cells")
Number of cells called in each sample before and after swapped molecule removal. Cell calling was performed with an algorithm like CellRanger's. The results demonstrate that CellRanger's cell calling algorithm can artefactual results whether or not swapped transcripts are removed.

Figure 41: Number of cells called in each sample before and after swapped molecule removal
Cell calling was performed with an algorithm like CellRanger’s. The results demonstrate that CellRanger’s cell calling algorithm can artefactual results whether or not swapped transcripts are removed.

Success in removing the effects of swapping from this dataset demonstrates the efficacy of the method for removing swapping artefacts in 10X data.

Finally, we run the method on two different datasets, together. Because they were processed and sequenced entirely separately, we should see no swapping and therefore minimal molecule removal between the two experiments. We use the HiSeq 2500 dataset described above, and a complete replicate of the same experiment that was processed and sequenced (again on HiSeq 2500) at a later date.

A summary of the two experiments is shown in Table 3

new_files = c(h5_files_2500,
              dir(paste0(folder_location, "/data/molecule_info/hiseq_2500_run2"), full.names = TRUE))

new_reads = swappedDrops(new_files, get.diagnostics = TRUE)$diagnostics

read_present = new_reads > 0
present_expt1 = read_present[,1:length(h5_files_2500)] 
present_expt2 = read_present[,-(1:length(h5_files_2500))]

n_mol = nrow(new_reads)
n_mol_expt1 = sum(Matrix::rowSums(present_expt1)>0)
n_mol_expt2 = sum(Matrix::rowSums(present_expt2)>0)
swapped_expt1 = sum(Matrix::rowSums(present_expt1)>1)
swapped_expt2 = sum(Matrix::rowSums(present_expt2)>1)
swapped_both = Matrix::rowSums(present_expt1) > 0 & Matrix::rowSums(present_expt2) > 0

df = data.frame("Metric" = c("Number of molecules", "Number of swapped molecules", "Swapped fraction"),
                "Experiment 1" = c(n_mol_expt1, swapped_expt1, format(swapped_expt1/n_mol_expt1, digits = 3)),
                "Experiment 2" = c(n_mol_expt2, swapped_expt2, format(swapped_expt2/n_mol_expt2, digits = 3)))

kable(df, caption = "Summary of the two different experiments.", row.names = FALSE)
Table 3: Summary of the two different experiments
Metric Experiment.1 Experiment.2
Number of molecules 44062616 241978544
Number of swapped molecules 26105 46211
Swapped fraction 0.000592 0.000191

Between the two datasets, 688 molecules were seen in both datasets (0.000241% of all observed molecules, approximately two orders of magnitude less than the number of molecules observed to swap within samples on HiSeq 2500).

6 Manuscript figure code

Below is the code used to generate the figures used in the manuscript.

# 
# rate = ggplot(df_4000, aes(x = tot, y = mapped)) +
#     geom_smooth(method = "lm", se = FALSE, col = "black", alpha = 0.3, lwd = 1.5) +
#   geom_point(col  = "cornflowerblue") +
#   # scale_x_log10() + scale_y_log10() +
#   labs(x = "Available swapping reads", y = "Observed swapped reads") +
#   theme_bw() +
#     scale_x_continuous(breaks = c(5e6, 1e7, 1.5e7), labels = c("5,000,000", "10,000,000", "15,000,000")) +
#   scale_y_continuous(breaks = seq(2500, 10000, 2500), labels = c("2,500", "5,000", "7,500", "10,000"))+
#   theme(axis.text = element_text(face = "bold", size = 15, colour = "black"),
#        axis.title.y = element_text(size = 20, face = "bold"),
#        axis.title.x = element_text(size = 20, face = "bold"))
# 
# print(rate)
# 
# 
# 

# bc_conc
#colourful but worse
# ggplot(conc_df, aes (x = bc_mol, y= seq_mol, col = swap_rate, size = swap_rate)) +
#   geom_point() +
#   scale_color_viridis(name = "Swap Rate") +
#   theme_bw() +
#   scale_size_continuous(name = "Swap Rate") +
#   labs(x = "Barcode molarity (pmol/L)", y = "Sequenced cDNA molarity (pmol/L)")


model = lm(conc_df$swap_rate ~ conc_df$bc_mol)
print(summary(model))
## 
## Call:
## lm(formula = conc_df$swap_rate ~ conc_df$bc_mol)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.027762 -0.014322  0.000392  0.010743  0.035275 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)    9.744e-03  2.101e-02   0.464    0.650
## conc_df$bc_mol 4.220e-06  5.161e-06   0.818    0.427
## 
## Residual standard error: 0.01797 on 14 degrees of freedom
## Multiple R-squared:  0.04559,    Adjusted R-squared:  -0.02258 
## F-statistic: 0.6688 on 1 and 14 DF,  p-value: 0.4272
#*1.1 for the 2500 rate

conc = ggplot(conc_df, aes(x = bc_mol, y= swap_rate * 1.1)) +
  geom_smooth(method = "lm", fullrange = TRUE, col = "darkgrey", fill = "grey80") +
  geom_point() + 
  annotate("text", x= 2800, y = 0.055, label = paste0("italic(R)^2 == ", format(summary(model)$r.squared, digits = 2)), parse = TRUE, fontface = "bold", size = 7) +
  theme_bw() +
  scale_x_continuous(breaks = seq(2000, 6000, 2000), limits = c(2000, 6000), labels = seq(2, 6, 2), name = "Free barcode concentration (nM)") +
  scale_y_continuous(name = "Estimated swapping rate", breaks = c(seq(0, 0.06, 0.02)), labels = paste0(seq(0,6,2), "%"), limits = c(-0.01, 0.07)) +
  theme(panel.grid = element_blank(),
        axis.title = element_text(face = "bold", size = 14),
        axis.text = element_text(face = "bold", size = 10, colour = "black"))

print(conc)

#get library size
get.libsizes = function(expr.matrix){
  
  test = emptyDrops(expr.matrix, test.args = list(niters = 30000))
  
  test = test[!is.na(test$FDR),]
  
  sig = test$FDR <= 0.01 & test$Total >= 1000
  
  return(test$Total[sig])
  
}

libs = lapply(seq_along(cleaned_4000$cleaned), 
              function(i) get.libsizes(cleaned_4000$cleaned[[i]] + cleaned_4000$swapped[[i]]))

libs = lapply(seq_along(libs), function(x) data.frame(libs = libs[[x]], sample = sample_map[x]))
## Warning in data.frame(libs = libs[[x]], sample = sample_map[x]): row names
## were found from a short variable and have been discarded

## Warning in data.frame(libs = libs[[x]], sample = sample_map[x]): row names
## were found from a short variable and have been discarded

## Warning in data.frame(libs = libs[[x]], sample = sample_map[x]): row names
## were found from a short variable and have been discarded

## Warning in data.frame(libs = libs[[x]], sample = sample_map[x]): row names
## were found from a short variable and have been discarded

## Warning in data.frame(libs = libs[[x]], sample = sample_map[x]): row names
## were found from a short variable and have been discarded

## Warning in data.frame(libs = libs[[x]], sample = sample_map[x]): row names
## were found from a short variable and have been discarded

## Warning in data.frame(libs = libs[[x]], sample = sample_map[x]): row names
## were found from a short variable and have been discarded

## Warning in data.frame(libs = libs[[x]], sample = sample_map[x]): row names
## were found from a short variable and have been discarded
libs = do.call(rbind, libs)



#libsize
libsize = ggplot(libs, aes(x = factor(sample, levels = sample_map[order(sample_map)]), y = libs, fill = substr(sample, 1, 1))) +
  geom_violin() +
  scale_y_log10(labels = c("1,000", "10,000", "100,000"), breaks = 10^(3:5)) +
  theme_bw() +
  theme(legend.position = "none", 
        panel.grid.minor.x = element_blank(), 
        panel.grid.major.x = element_blank(), 
        panel.grid.minor.y = element_blank(), 
        axis.text.y = element_text(size = 10, face = "bold", colour = "black"), 
        axis.text.x = element_text(size = 12, face = "bold", colour = "black"), 
        axis.title = element_text(face = "bold", size = 15)) +
  labs(x = "Sample", y = "# UMIs") +
  scale_fill_brewer(palette = "Set2")


print(libsize)

cells_df = melt(cellcall)
## Using sample as id variables
#CELL CALLING 
cells = ggplot(cells_df, aes(x = sample, fill = variable, y = value)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_bw() +
  scale_fill_manual(values = c("#665178", "#A9CDC3"), name = "", label = c("All molecules", "Swapped molecules\nremoved")) +
  labs(x = "Sample", y = "# cells called") +
  theme(legend.position = c(0.3, 0.8),
        legend.title = element_blank(),
        legend.background = element_rect(fill=alpha('white', 1), colour = "black"),
        legend.text = element_text(size = 10, colour = "black", face = "bold"),
        panel.grid.minor.x = element_blank(), 
        panel.grid.major.x = element_blank(), 
        panel.grid.minor.y = element_blank(), 
        axis.text.y = element_text(size = 10, face = "bold", colour = "black"), 
        axis.text.x = element_text(size = 12, face = "bold", colour = "black"), 
        axis.title = element_text(face = "bold", size = 15)) +
  scale_y_continuous(breaks = c(0,500, 1000), labels = c("0", "500", "1,000"))


print(cells)

swap_fix = plot_grid(libsize, cells, align = "hv", ncol = 1, labels = c("C", "D"))


master_plot = plot_grid(conc, NULL, swap_fix, nrow = 2, labels = c("A", "B"))

save_plot(plot = master_plot, filename = paste0(folder_location, "/figs/cowplot_out.pdf"), useDingbats = FALSE, base_width = 15, base_height = 5)

save(conc_df, libs, cells_df, sample_map, df_4000, file = paste0(folder_location, "/figs/plot_dfs.RData"))

References

Illumina. 2017. “Effects of Index Misassignment on Multiplexing and Downstream Analysis.” [White Paper]. Accessed June 21. https://www.illumina.com/content/dam/illumina-marketing/documents/products/whitepapers/index-hopping-white-paper-770-2017-004.pdf?linkId=36607862.

Marioni, John C., Christopher E. Mason, Shrikant M. Mane, Matthew Stephens, and Yoav Gilad. 2008. “RNA-Seq: An Assessment of Technical Reproducibility and Comparison with Gene Expression Arrays.” Genome Research 18 (9): 1509–17. doi:10.1101/gr.079558.108.

Nestorowa, Sonia, Fiona K. Hamey, Blanca Pijuan Sala, Evangelia Diamanti, Mairi Shepherd, Elisa Laurenti, Nicola K. Wilson, David G. Kent, and Berthold Göttgens. 2016. “A Single Cell Resolution Map of Mouse Haematopoietic Stem and Progenitor Cell Differentiation.” Blood, January, blood–2016–05–716480. doi:10.1182/blood-2016-05-716480.

Picelli, Simone, Omid R. Faridani, Asa K. Bjorklund, Gosta Winberg, Sven Sagasser, and Rickard Sandberg. 2014. “Full-Length RNA-Seq from Single Cells Using Smart-Seq2.” Nature Protocols 9 (1): 171–81. doi:10.1038/nprot.2014.006.

Sinha, Rahul, Geoff Stanley, Gunsagar Singh Gulati, Camille Ezran, Kyle Joseph Travaglini, Eric Wei, Charles Kwok Fai Chan, et al. 2017. “Index Switching Causes ‘Spreading-of-Signal’ Among Multiplexed Samples in Illumina HiSeq 4000 DNA Sequencing.” bioRxiv, April. doi:10.1101/125724.

Zheng, Grace X. Y., Jessica M. Terry, Phillip Belgrader, Paul Ryvkin, Zachary W. Bent, Ryan Wilson, Solongo B. Ziraldo, et al. 2017. “Massively Parallel Digital Transcriptional Profiling of Single Cells.” Nature Communications 8 (January): 14049. http://dx.doi.org/10.1038/ncomms14049.